Category Archives: Normlization

Normalization social skills

Security data normalization, like any other standardization effort, has a very human aspect to it. If you ever discussed a schema with someone, you know the discussion can get emotional. You would think that deep technical issues are at stake, but it is usually the very basics of normalization that tend to start the commotion: field naming.

Why argue about field names?

I actually find it quite natural. While machine analytics can cope with any name, human users can benefit from just the right field name. That is, If there was just a right field name… it is a subjective topic after all. And this is why good normalization has to develop good social skills.

In ASIM, the Azure Sentinel Information Model, we introduce two techniques to try to make field names serve analysts better, whatever their taste is: Descriptive Scenarios and Aliases. Neither is groundbreaking technology. After all, they try to add some social skills to normalization.

Descriptive scenarios address normalizing the role each entity plays in an event. Most of the information conveyed in an event is about the entities: users, devices, files, processes, and more. But events often include more than one entity of the same type, and those are usually designated by a prefix: Src, Dst, and the like. Being ubiquitous, this is probably a good solution, but there are just so many of those to make it quite confusing. Destination or Target? Source, Actor, or Initiator?

In ASIM, we try to normalize but still keep the prefix intuitive. A user would be an Actor, but a host is a Source. As always, with social skills, talking about it is important. Therefore we provide in the documentation descriptive scenarios that make it easier for analysts to internalize the prefixes we selected. Those are the scenarios for the User entity:

  • Create User – An Actor created or modified a Target User
  • Modify user – An Actor renamed Target User to Updated User.
  • Sign-in – An Actor signed in to a system as a Target User.
  • E-Mail – An Actor sends an email to a Target User
  • Network connection –  A process running as Actor on the source host, communicating with a process running as Target User on the destination host
  • DNS request – An Actor initiated a DNS query
  • Process creation – An Actor (the user associated with the initiating process) has initiated process creation. The process created runs under the credentials of a Target User (the user related to the target process).

We hope that such scenarios will help analysts better understand who is who, especially in the more complex scenarios such as modifying a user or process creation.

Another intuitive concept, we added – well, trivial for that matter –  is Aliases. If we cannot agree on the best name, why not have two or even more?

We find that aliases are handy in several situations:

  • Getting rid of prefixes (and suffixes while at that) – It is much easier to use the “User”, “IpAddr”, “Dvc” or “CommandLine” than the convoluted version, say “DvcHostname” or “ActingProcessCommandLine”. Obviously, as discussed above, prefixes are important. However, a short name to designate the most useful entity or entity attribute is very useful.
  • Not making a choice – sometimes a value is something to a group and another thing to others. For example, the DNS protocol field Query holds, most often, a domain name. It would be a Query for a DNS expert, while for a typical analyst, it would be a Domain. So we allow both.
  • Backward compatibility – version management is not glamorous but important. Sometimes you want to update. Maintaining the old name can be done using an alias.

Obviously, the underlying technology has to support aliases efficiently and not require data duplication. Query time normalization usually has an easier time than ingest time normalization in supporting aliases. This is a good reason to support query time normalization, even if alongside ingest time capabilities.

I would love to hear your thoughts about those and other areas in which normalization can be more social!

SIEM Normalization Dirty Secret: Values

When posting my first post on normalization on LinkedIn, I was pleasantly surprised that the ensuing discussion got to my favorite normalization topic: value normalization. Mehmet Ergene even linked to his interesting article on the topic.


Because to a large extent, I think that the missing piece in SIEM data normalization so far is value normalization. I was going to start a 40-page long post covering everything about value normalization, but hey… you will not read it. So I will start with an example: our recent Azure Sentinel Registry schema.

Normalizing Registry events is one of the simpler normalization exercises. The registry is a Windows concept, and the events reported are always the same. Just the reporting system changes. Compare that to, say, authentication events, which might inherently behave differently in different systems. Moreover, to start with, we created parsers only for Microsoft solutions that report on Registry activity:

As you will see, even in this simple exercise, value normalization is important and far from trivial.

Does the key fit the lock?

The most important field in a registry event is the key name. Keys in the registry are like folders in file systems. To understand what the event is about, you need the key.

However, the exact same key has different values when reported by different systems. For example:

Windows \REGISTRY\MACHINE\SOFTWARE\Microsoft\Windows Defender\Signature Updates
Sysmon HKLM\SOFTWARE\Microsoft\Windows Defender\Signature Updates\LastEmergencySigCheck
Defender HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows Defender\Signature Updates

Each one is different! This certainly affects queries such as this, of which the relevant snippet is:

| where RegistryKey has_all ("HKEY_LOCAL_MACHINE", "Image File Execution Options")

The value “HKEY_LOCAL_MACHINE” will have to be different for different event sources as each system logs the key prefix differently. If we just normalize the key field name and not the value, queries would still have to account for the difference, and analysts understand the peculiarities of each source.

In the Azure Sentinel Information Model (ASIM), we require normalizing the key value, enabling the query above to work. However, the list of options is not comprehensive, and here is exactly where the community can work together and help to extend.

You probably noticed that the Sysmon value has another difference: there is an additional part at the end. The reason is that Sysmon reports the key and the value (which is similar to a filename in file systems) together. This is not a value normalization challenge but rather an example of how field name mapping, usually considered the easy part of normalization, has its complexities. In this case, the Sysmon field has to be split and mapped to two different fields in the target schema.

It’s a bird, it’s a plane, it’s a DWORD…

While not as cardinal as the registry key, another value reported differently by different sources is the type of registry values. I found different solutions to report the same type as “Reg_DWord“, “Dword“, or “%%1876“.

The first two are easy to address if (and only if) one is aware of the issue: an analyst will surely get it, and analytics can search for “dword” as a substring.

The last option, “%%1876“, demonstrates a common value normalization challenge. The use of codes in events. “%%1876” is the Windows code for “Dword“. However, this is not something an analyst should know. In ASIM, we require normalizing the value to the first option (“Reg_DWord“), and as a byproduct, also ensure that the value clear to analysts.

Another example of codes vs. labels is DNS logs. Most analytics based on DNS events uses the error code reported. Here is the start of a typical Azure Sentinel’s DNS detections:

| where isnotempty(ResponseCodeName)
| where ResponseCodeName =~ "NXDOMAIN"

However, some DNS sources report the error code using a numerical format, while others are using a label. As IANA’s mapping suggests, the code for NXDOMAIN is 3.

But what is it all about anyway?

The previous examples demonstrate well what value normalization is. However, the most fundamental value normalization challenge, which is relevant to every schema, is the core fields that tell us what the event actually was:

  • Type: what activity was actually reported?
  • Result: was the activity successful or not?
  • Result Details: what was the reason for failure?
  • Action: the action performed by the reporting device. While not universal and only typical to security systems, it is common enough and important enough to include it here.

Since most sources report only success for registry events, only the first is relevant in our example. But, even there, value normalization is needed. The activity of deleting a value is represented using either “DeleteValue”, “RegistryValueDeleted”, or %%1906″.

Of the list, I find that Result Details and Action are important and mostly overlooked. We tried to tackle the former in the ASIM authentication schema by specifying values for EventResultDetails. However, this is an area in which source devices differ widely, making it a real challenge. 

Why should you care?

The topics presented above help you understand better what value normalization is. There are other value normalization challenges, for example, ensuring the time fields format is consistent. There is also an adjacent problem of normalizing identifiers. All will be discussed in upcoming posts.

But is this important to you?

Yes. It is. If values are not normalized, you cannot create source agnostic analytics, and each query will have to handle each source’s peculiarities. Consequentially, you will have to understand each source intimately. Obviously, this eliminates one of the central goals of normalization.

You may think that the challenge is the schema definition. While none provides comprehensive support for value normalization, most pay at least lip service to it. What’ you should be checking is whether your parser, technical adapter, or app, even if marked as schema compliant, actually normalizes values. 

Should we normalize security data?

I haven’t blogged for quite a while. Recently I started spending my time again on security research, and the blogging itch is back. My current research focus is security data normalization, and in the next few posts, I will expand on this topic.

The first question that comes to mind is, why normalize? Should we normalize at all? That is obviously once we agree on what normalization is. So let’s start there.

At its core, normalization means that data collected from different sources should be converted to a uniform presentation or schema. Such a uniform schema enables analytics to be source agnostic. It also reduces the learning curve for analysts and enables them to be more productive. The article “SIEM Event Normalization Makes Raw Data Relevant to Both Humans and Machines” provides a good starting point for the rationale.

To deliver on the promise, SIEMs have tried to implement normalization since day one. ArcSight CEF and categorization, Splunk CIM, and QRadar LEEF are all normalization schemes.

Where they successful?

In his seminal blog post, “Security Correlation Then and Now: A Sad Truth About SIEM“, Anthon Chuvakin claims that they were not. And I tend to agree. Want proof? If you are a serious security analyst, the number 4624 means something to you. Obviously, it is the Windows Login event. More precisely, successful login (4625 logs failures). You might also know that Login Type 2 is “interactive”, or at least you know that you need to consult Randy Franklin Smith’s excellent Ultimate Windows Security. I have certainly used it a lot, as you can see on the right. Or just Google for 4624.

In a perfect world, an analyst would not need to know about event 4624. ArcSight categorization whitepaper mapped it already in the first decade of the millennium to this:

Now, how many people converse in 6424, and how many know the ArcSight categorization. How many systems analyze 4624 events, how many support ArcSight, or an alternative, categorization scheme?

So Anton has a point.

Now back to 2021. My current research at Microsoft, leading the Azure Sentinel Information Model (ASIM) initiative, enables me to get back to the challenge ArcSight started tackling more than 20 years ago. And I hope this time to move the needle. Let there be a generation of security analysts who don’t know what 4624 is (and not because Windows will die).

As a starting point, we recently released the ASIM Authentication model, which includes a normalizing 4624 parser. I am sure it is not perfect, and we are already getting ideas for improvement.

In the upcoming blog posts, I will discuss how we try to make normalization work this time. I will address areas such as:

  • Value normalization
  • Entities, entity IDs, and entity descriptors
  • Aliasing

So let’s start the journey.