When posting my first post on normalization on LinkedIn, I was pleasantly surprised that the ensuing discussion got to my favorite normalization topic: value normalization. Mehmet Ergene even linked to his interesting article on the topic.
Because to a large extent, I think that the missing piece in SIEM data normalization so far is value normalization. I was going to start a 40-page long post covering everything about value normalization, but hey… you will not read it. So I will start with an example: our recent Azure Sentinel Registry schema.
Normalizing Registry events is one of the simpler normalization exercises. The registry is a Windows concept, and the events reported are always the same. Just the reporting system changes. Compare that to, say, authentication events, which might inherently behave differently in different systems. Moreover, to start with, we created parsers only for Microsoft solutions that report on Registry activity:
- Windows itself using event 4657,
- Sysmon events 12, 13, and 14,
- Microsoft 365 Defender for Endpoints (Defender for short) advanced hunting DeviceReigstryEvents table.
As you will see, even in this simple exercise, value normalization is important and far from trivial.
Does the key fit the lock?
The most important field in a registry event is the key name. Keys in the registry are like folders in file systems. To understand what the event is about, you need the key.
However, the exact same key has different values when reported by different systems. For example:
|Windows||\REGISTRY\MACHINE\SOFTWARE\Microsoft\Windows Defender\Signature Updates|
|Sysmon||HKLM\SOFTWARE\Microsoft\Windows Defender\Signature Updates\LastEmergencySigCheck|
|Defender||HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows Defender\Signature Updates|
Each one is different! This certainly affects queries such as this, of which the relevant snippet is:
imRegistry | where RegistryKey has_all ("HKEY_LOCAL_MACHINE", "Image File Execution Options")
The value “HKEY_LOCAL_MACHINE” will have to be different for different event sources as each system logs the key prefix differently. If we just normalize the key field name and not the value, queries would still have to account for the difference, and analysts understand the peculiarities of each source.
In the Azure Sentinel Information Model (ASIM), we require normalizing the key value, enabling the query above to work. However, the list of options is not comprehensive, and here is exactly where the community can work together and help to extend.
You probably noticed that the Sysmon value has another difference: there is an additional part at the end. The reason is that Sysmon reports the key and the value (which is similar to a filename in file systems) together. This is not a value normalization challenge but rather an example of how field name mapping, usually considered the easy part of normalization, has its complexities. In this case, the Sysmon field has to be split and mapped to two different fields in the target schema.
It’s a bird, it’s a plane, it’s a DWORD…
While not as cardinal as the registry key, another value reported differently by different sources is the type of registry values. I found different solutions to report the same type as “Reg_DWord“, “Dword“, or “%%1876“.
The first two are easy to address if (and only if) one is aware of the issue: an analyst will surely get it, and analytics can search for “dword” as a substring.
The last option, “%%1876“, demonstrates a common value normalization challenge. The use of codes in events. “%%1876” is the Windows code for “Dword“. However, this is not something an analyst should know. In ASIM, we require normalizing the value to the first option (“Reg_DWord“), and as a byproduct, also ensure that the value clear to analysts.
Another example of codes vs. labels is DNS logs. Most analytics based on DNS events uses the error code reported. Here is the start of a typical Azure Sentinel’s DNS detections:
imDns | where isnotempty(ResponseCodeName) | where ResponseCodeName =~ "NXDOMAIN" ...
However, some DNS sources report the error code using a numerical format, while others are using a label. As IANA’s mapping suggests, the code for NXDOMAIN is 3.
But what is it all about anyway?
The previous examples demonstrate well what value normalization is. However, the most fundamental value normalization challenge, which is relevant to every schema, is the core fields that tell us what the event actually was:
- Type: what activity was actually reported?
- Result: was the activity successful or not?
- Result Details: what was the reason for failure?
- Action: the action performed by the reporting device. While not universal and only typical to security systems, it is common enough and important enough to include it here.
Since most sources report only success for registry events, only the first is relevant in our example. But, even there, value normalization is needed. The activity of deleting a value is represented using either “DeleteValue”, “RegistryValueDeleted”, or %%1906″.
Of the list, I find that Result Details and Action are important and mostly overlooked. We tried to tackle the former in the ASIM authentication schema by specifying values for EventResultDetails. However, this is an area in which source devices differ widely, making it a real challenge.
Why should you care?
The topics presented above help you understand better what value normalization is. There are other value normalization challenges, for example, ensuring the time fields format is consistent. There is also an adjacent problem of normalizing identifiers. All will be discussed in upcoming posts.
But is this important to you?
Yes. It is. If values are not normalized, you cannot create source agnostic analytics, and each query will have to handle each source’s peculiarities. Consequentially, you will have to understand each source intimately. Obviously, this eliminates one of the central goals of normalization.
You may think that the challenge is the schema definition. While none provides comprehensive support for value normalization, most pay at least lip service to it. What’ you should be checking is whether your parser, technical adapter, or app, even if marked as schema compliant, actually normalizes values.