Normalization social skills

Security data normalization, like any other standardization effort, has a very human aspect to it. If you ever discussed a schema with someone, you know the discussion can get emotional. You would think that deep technical issues are at stake, but it is usually the very basics of normalization that tend to start the commotion: field naming.

Why argue about field names?

I actually find it quite natural. While machine analytics can cope with any name, human users can benefit from just the right field name. That is, If there was just a right field name… it is a subjective topic after all. And this is why good normalization has to develop good social skills.

In ASIM, the Azure Sentinel Information Model, we introduce two techniques to try to make field names serve analysts better, whatever their taste is: Descriptive Scenarios and Aliases. Neither is groundbreaking technology. After all, they try to add some social skills to normalization.

Descriptive scenarios address normalizing the role each entity plays in an event. Most of the information conveyed in an event is about the entities: users, devices, files, processes, and more. But events often include more than one entity of the same type, and those are usually designated by a prefix: Src, Dst, and the like. Being ubiquitous, this is probably a good solution, but there are just so many of those to make it quite confusing. Destination or Target? Source, Actor, or Initiator?

In ASIM, we try to normalize but still keep the prefix intuitive. A user would be an Actor, but a host is a Source. As always, with social skills, talking about it is important. Therefore we provide in the documentation descriptive scenarios that make it easier for analysts to internalize the prefixes we selected. Those are the scenarios for the User entity:

  • Create User – An Actor created or modified a Target User
  • Modify user – An Actor renamed Target User to Updated User.
  • Sign-in – An Actor signed in to a system as a Target User.
  • E-Mail – An Actor sends an email to a Target User
  • Network connection –  A process running as Actor on the source host, communicating with a process running as Target User on the destination host
  • DNS request – An Actor initiated a DNS query
  • Process creation – An Actor (the user associated with the initiating process) has initiated process creation. The process created runs under the credentials of a Target User (the user related to the target process).

We hope that such scenarios will help analysts better understand who is who, especially in the more complex scenarios such as modifying a user or process creation.

Another intuitive concept, we added – well, trivial for that matter –  is Aliases. If we cannot agree on the best name, why not have two or even more?

We find that aliases are handy in several situations:

  • Getting rid of prefixes (and suffixes while at that) – It is much easier to use the “User”, “IpAddr”, “Dvc” or “CommandLine” than the convoluted version, say “DvcHostname” or “ActingProcessCommandLine”. Obviously, as discussed above, prefixes are important. However, a short name to designate the most useful entity or entity attribute is very useful.
  • Not making a choice – sometimes a value is something to a group and another thing to others. For example, the DNS protocol field Query holds, most often, a domain name. It would be a Query for a DNS expert, while for a typical analyst, it would be a Domain. So we allow both.
  • Backward compatibility – version management is not glamorous but important. Sometimes you want to update. Maintaining the old name can be done using an alias.

Obviously, the underlying technology has to support aliases efficiently and not require data duplication. Query time normalization usually has an easier time than ingest time normalization in supporting aliases. This is a good reason to support query time normalization, even if alongside ingest time capabilities.

I would love to hear your thoughts about those and other areas in which normalization can be more social!

SIEM Normalization Dirty Secret: Values

When posting my first post on normalization on LinkedIn, I was pleasantly surprised that the ensuing discussion got to my favorite normalization topic: value normalization. Mehmet Ergene even linked to his interesting article on the topic.


Because to a large extent, I think that the missing piece in SIEM data normalization so far is value normalization. I was going to start a 40-page long post covering everything about value normalization, but hey… you will not read it. So I will start with an example: our recent Azure Sentinel Registry schema.

Normalizing Registry events is one of the simpler normalization exercises. The registry is a Windows concept, and the events reported are always the same. Just the reporting system changes. Compare that to, say, authentication events, which might inherently behave differently in different systems. Moreover, to start with, we created parsers only for Microsoft solutions that report on Registry activity:

As you will see, even in this simple exercise, value normalization is important and far from trivial.

Does the key fit the lock?

The most important field in a registry event is the key name. Keys in the registry are like folders in file systems. To understand what the event is about, you need the key.

However, the exact same key has different values when reported by different systems. For example:

Windows \REGISTRY\MACHINE\SOFTWARE\Microsoft\Windows Defender\Signature Updates
Sysmon HKLM\SOFTWARE\Microsoft\Windows Defender\Signature Updates\LastEmergencySigCheck
Defender HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows Defender\Signature Updates

Each one is different! This certainly affects queries such as this, of which the relevant snippet is:

| where RegistryKey has_all ("HKEY_LOCAL_MACHINE", "Image File Execution Options")

The value “HKEY_LOCAL_MACHINE” will have to be different for different event sources as each system logs the key prefix differently. If we just normalize the key field name and not the value, queries would still have to account for the difference, and analysts understand the peculiarities of each source.

In the Azure Sentinel Information Model (ASIM), we require normalizing the key value, enabling the query above to work. However, the list of options is not comprehensive, and here is exactly where the community can work together and help to extend.

You probably noticed that the Sysmon value has another difference: there is an additional part at the end. The reason is that Sysmon reports the key and the value (which is similar to a filename in file systems) together. This is not a value normalization challenge but rather an example of how field name mapping, usually considered the easy part of normalization, has its complexities. In this case, the Sysmon field has to be split and mapped to two different fields in the target schema.

It’s a bird, it’s a plane, it’s a DWORD…

While not as cardinal as the registry key, another value reported differently by different sources is the type of registry values. I found different solutions to report the same type as “Reg_DWord“, “Dword“, or “%%1876“.

The first two are easy to address if (and only if) one is aware of the issue: an analyst will surely get it, and analytics can search for “dword” as a substring.

The last option, “%%1876“, demonstrates a common value normalization challenge. The use of codes in events. “%%1876” is the Windows code for “Dword“. However, this is not something an analyst should know. In ASIM, we require normalizing the value to the first option (“Reg_DWord“), and as a byproduct, also ensure that the value clear to analysts.

Another example of codes vs. labels is DNS logs. Most analytics based on DNS events uses the error code reported. Here is the start of a typical Azure Sentinel’s DNS detections:

| where isnotempty(ResponseCodeName)
| where ResponseCodeName =~ "NXDOMAIN"

However, some DNS sources report the error code using a numerical format, while others are using a label. As IANA’s mapping suggests, the code for NXDOMAIN is 3.

But what is it all about anyway?

The previous examples demonstrate well what value normalization is. However, the most fundamental value normalization challenge, which is relevant to every schema, is the core fields that tell us what the event actually was:

  • Type: what activity was actually reported?
  • Result: was the activity successful or not?
  • Result Details: what was the reason for failure?
  • Action: the action performed by the reporting device. While not universal and only typical to security systems, it is common enough and important enough to include it here.

Since most sources report only success for registry events, only the first is relevant in our example. But, even there, value normalization is needed. The activity of deleting a value is represented using either “DeleteValue”, “RegistryValueDeleted”, or %%1906″.

Of the list, I find that Result Details and Action are important and mostly overlooked. We tried to tackle the former in the ASIM authentication schema by specifying values for EventResultDetails. However, this is an area in which source devices differ widely, making it a real challenge. 

Why should you care?

The topics presented above help you understand better what value normalization is. There are other value normalization challenges, for example, ensuring the time fields format is consistent. There is also an adjacent problem of normalizing identifiers. All will be discussed in upcoming posts.

But is this important to you?

Yes. It is. If values are not normalized, you cannot create source agnostic analytics, and each query will have to handle each source’s peculiarities. Consequentially, you will have to understand each source intimately. Obviously, this eliminates one of the central goals of normalization.

You may think that the challenge is the schema definition. While none provides comprehensive support for value normalization, most pay at least lip service to it. What’ you should be checking is whether your parser, technical adapter, or app, even if marked as schema compliant, actually normalizes values. 

Should we normalize security data at all?

I haven’t blogged for quite a while. Recently I started spending my time again on security research, and the blogging itch is back. My current research focus is security data normalization, and in the next few posts, I will expand on this topic.

The first question that comes to mind is, why normalize? Should we normalize at all? That is obviously once we agree on what normalization is. So let’s start there.

At its core, normalization means that data collected from different sources should be converted to a uniform presentation or schema. Such a uniform schema enables analytics to be source agnostic. It also reduces the learning curve for analysts and enables them to be more productive. The article “SIEM Event Normalization Makes Raw Data Relevant to Both Humans and Machines” provides a good starting point for the rationale.

To deliver on the promise, SIEMs have tried to implement normalization since day one. ArcSight CEF and categorization, Splunk CIM, and QRadar LEEF are all normalization schemes.  

Where they successful?

In his seminal blog post, “Security Correlation Then and Now: A Sad Truth About SIEM“, Anthon Chuvakin claims that they were not. And I tend to agree. Want proof? If you are a serious security analyst, the number 4624 means something to you. Obviously, it is the Windows Login event. More precisely, successful login (4625 logs failures). You might also know that Login Type 2 is “interactive”, or at least you know that you need to consult Randy Franklin Smith’s excellent

Ultimate Windows Security. I have certainly used it a lot, as you can see on the right. Or just Google for 4624.

In a perfect world, an analyst would not need to know about event 4624. ArcSight categorization whitepaper mapped it already in the first decade of the millennium to this:

Now, how many people converse in 6424, and how many know the ArcSight categorization. How many systems analyze 4624 events, how many support ArcSight, or an alternative, categorization scheme?

So Anton has a point.

Now back to 2021. My current research at Microsoft, leading the Azure Sentinel Information Model (ASIM) initiative, enables me to get back to the challenge ArcSight started tackling more than 20 years ago. And I hope this time to move the needle. Let there be a generation of security analysts who don’t know what 4624 is (and not because Windows will die).

As a starting point, we recently released the ASIM Authentication model, which includes a normalizing 4624 parser. I am sure it is not perfect, and we are already getting ideas for improvement. 

In the upcoming blog posts, I will discuss how we try to make normalization work this time. I will address areas such as:

  • Value normalization
  • Entities, entity IDs, and entity descriptors
  • Aliasing

So let’s start the journey.

Keeping Ahead of the Hackers

While my posts are typically more focused, Andy Green who is managing digital content at Varonis, thought it would be a good idea to share thoughts around the evolution of the threat landscape over the years: how did attack techniques evolve, the changes brought by the dark web and the economics of hacking, what we, the defenders are doing – wrong or right – and what we should do better.

So if you are into some techno-philosophical thoughts about cybersecurity, here it is:

Brute Force: Anatomy of an Attack

I am back to blogging, but my blog posts now appear on the Varonis blog. I will keep publishing links to those posts here for my loyal followers.

This time:

The media coverage of NotPetya has hidden what might have been a more significant attack: a brute force attack on the UK Parliament.  While for many it was simply fertile ground for Twitter Brexit jokes, an attack like this that targets a significant government body is a reminder that brute force remains a common threat to be addressed.

It also raises important questions as to how such an attack could have happened in the first place.  These issues do suggest that we need to look deeper into this important, but often misunderstood type of attack.

Read more…

Bobby Tables real life coutnterpart

If you are a member of the application security community, you are bound to know this hilarious xkcd cartoon. It is so good that it found its way to non-expert circles. I once got it physically framed as a birthday present from friends.


Like most of you, I though that this is a great way to explain SQL injection. For most of us, this is what it is. For a few, it is a real life problem

My dear friend Or Katz published an even more hilarious blog post outlining the challenges of someone who happens to have a first name which is an SQL keyword. His post is also a very good discussion of the use, or rather abuse, of signatures for web application security. A great and worthwhile read.

Anniversary to the ModSecurity Core Rule Set celebrated with a new major release

I have a very warm place reserved for the ModSecurity Core Rule Set (CRS). I created it a decade ago. Actually the first release in the readme file, labeled 1.1, is dated to October 2006, so this is an anniversary. And what a great present I got for the Anniversary from Chaim Sanders, Walter Hop and my dear friend Christian Folini: a brand new major release!

If you don’t know what the CRS is, a short introduction is due. Continue reading

The WAF Guidebook: What is a Web Application Firewall?

Simply put, Web Application Firewalls are security controls designed to provide the best automated operational protection for HTTP applications, whether web based on mobile. What is “the best” protection, or even “sufficient protection” is not a simple question.  As a result there is a spectrum of solutions for protecting web applications with varying quality and functionality. Which one can call itself a web application firewall is not an easy question to answer.

Probably the only way to define a web application firewall is to list the key features common to web application firewalls uniquely suited for protecting web and mobile applications and which would differentiate than other operational security controls such as intrusion prevention systems and network firewalls. The following sections touch on those key features of WAFs. A fuller discussion of the features will follow in later posts. Continue reading

Sorting out web automation attacks

If anything makes  web applications security different, and more interesting, than traditional information security, those are threats to the application logic, i.e. attacks that abuse legitimate functionality. Such attacks often raise legal and ethical questions: if this is legitimate functionality, can it be an attack? Ethical questions a side, there is no question that click fraud, scraping and comment spam cause real pain and financial damage to web site owners.

The new OWASP automated threat handbook tries to sort out this field and define an ontology for web automation attacks and for countermeasures.

My own presentation on the topic takes a different approach: there is no real dividing line between valid and malicious automation. It is a continuum. I scored each such automation technique for “obviousness”, i.e. how clear it would be that this is automated and not and for maliciousness.  Based on the scores I split the techniques into obviously malicious, accepted and borderline. So for example, given a 1-5 scale (1 being not obvious/not malicious, 5 being obvious/malicious):

  • “Auction sniping” gets 2 for obviousness and 3 for maliciousness – which makes it borderline.
  • “Web spam” 3 for obviousness and 4 for maliciousness – the extra points puts it in the malicious category.
  • At the edges, Blind SQL injection gets 5 and 5 (so it is extra malicious) while comparative shopping gets 1 for maliciousness as it became a standard in the industry. Which does not imply the “attacked” web site is not negatively impacted.

Read the presentation to get the scored for all of them and learn what “Queue Jumping”, “auction sniping” and “web spam” are!