Machine learning by ESET: The road to Augur

Machine learning (ML) in eight blogposts?! In truth, we’ve only just scratched the surface when it comes to the potential of ML in cybersecurity. However, that said, readers should now be better able to separate fact from fiction, and marketing from actual function. So, last but not least, let’s take a peek under the hood of ESET’s cybersecurity engine and its ML gears.

Our experts have been playing with machine learning for more than 20 years – with neural networks making their first appearance in our products in 1997. Since then there have been numerous internal projects aimed at automating security analysis, helping us categorize the virtual world into the good, the bad and the ugly (or even grey areas containing potentially unwanted applications or PUAs, if you will).

One of our early efforts was an automated expert system, designed for mass processing. In 2006, it was quite simple and helped us process part of the growing number of samples and cutting the immense workload of our detection engineers. Over the years, we have perfected its abilities and made it a crucial part of the technology responsible for the initial sorting and classification of the hundreds of thousands of items we receive every day from sources such as our worldwide network ESET LiveGrid®, security feeds and our ongoing exchange with other security vendors.

Another ML project has been running under ESET’s hood since 2012, placing all the analyzed items on “the cybersecurity map” and flagging those that require more attention. Interestingly, it was exactly this system that did a great job in the recent WannaCryptor case, alerting us in the earliest phases about the soon to be wildly spreading ransomware file. Despite already having a network detection for the EternalBlue exploit, this system helped ESET make additional detections that further improved the protection of our users.

However, machine learning is a tricky beast and not all our efforts have gone according to plan. Older projects focused on automating the creation of broader DNA detections from previously known detections, determining URL reputations, or finding the “nearest neighbors” of samples. Eventually, these were outperformed by other, more effective means, or replaced thanks to further development.

However, all of this helped us gain experience, and, piece-by-piece, has paved the way for what we have today – a mature, real-world application of machine learning technology in the cloud, as well as on clients' endpoints.

Meet Augur, our ML beast

At ESET we love ancient history – the company is named after an Egyptian goddess, after all – so that was naturally the place we looked for when it came to naming our machine learning engine. In Ancient Rome, augur was a term used for religious officials who observed natural signs and interpreted these as an indication of divine approval or disapproval of a proposed action. The analogy with cybersecurity is not hard to draw, but in contrast with the alchemy-natured augurs back then, our Augur bases its decisions on science, mathematics and previous experience.

Now for the technical part. ESET’s Augur ML engine couldn’t have materialized without three main factors:

With the arrival of big data and cheaper hardware, machine learning was made more affordable – be it for medical purposes, autonomous cars or detections in cybersecurity.
Growing popularity of ML algorithms and the science behind it led to their broader technical application and availability to anyone who was willing to implement them.
After three decades of fighting black-hats and their “products”, we have built a latter-day “Library of Alexandria” equivalent – of malware. This vast and highly organized database contains millions of extracted features and DNA genes of everything we’ve analyzed in the past. A great foundation to create a carefully chosen mix for Augur’s training set.

However, the boom of the above named factors have also brought challenges. We have had to pick the best performing algorithms and approaches, as not all machine learning is applicable to the highly specific security universe.

After much testing, we have settled on combining two methodologies that have proven effective so far:

Neural networks, specifically deep learning and long short-term memory.
Consolidated output of six precisely chosen classification algorithms.

Not clear enough? Imagine you have a suspicious executable file. Augur will first emulate its behavior and run a basic DNA analysis. Then it will use the gathered information to extract numeric features from the file, look at which processes it wants to run and look at the DNA mosaic in order to decide which category it fits best – clean, potentially unwanted or malicious. At this point, it is important to state that unlike some vendors who claim they do not need unpacking, behavioral analyzing or emulation, we find this crucial to extract data properly for machine learning. Otherwise – when data is compressed or encrypted – it’ just an attempt to classify noise.

The group of classification algorithms has two possible setups:

The more aggressive one will label a sample as malicious if most of the six algorithms vote it as such. This is useful mainly for IT staff using ESET Enterprise Inspector, as it can flag anything suspicious and leave the final evaluation of the outputs to a competent admin.

The milder, or more conservative, approach declares a sample clean if at least one of the six algorithms comes to such conclusion. This is useful for general purpose systems with a less expert overview.

We know visuals are everything today, so if the previous explanations weren’t clear enough, this chart might help:

Just to top it all off, we came across a presentation by Facebook describing their machine learning solution and it very much resembles Augur’s architecture – aiming to combine the best of the classification algorithms and neural networks.

Okay, so let’s move away from theory and look at the real world results of ESET’s machine learning approach as applied to the recent malware attacks misusing the EternalBlue exploit and pushing both the WannaCryptor ransomware and CoinMiner malware families. Apart from our network detection and effective flagging by our other ML system, the Augur model also immediately identified samples of both families as malicious.

What’s more interesting, we also ran this test with a month old Augur model that couldn’t have encountered these malware families anywhere before. This means that the detections were based solely on the information learned from the training set. And guess what? They were both correctly labeled as malicious.

30 years of progress and innovation in IT security has taught us that some things do not have an easy solution, especially in cyberspace, where change comes rapidly and the playing field can shift in a matter of minutes. Machine learning, even when wrapped up in shiny marketing-speak, won’t change that anytime soon. Therefore, we believe that even the best ML cannot replace skilled and experienced researchers, those who built its foundations and those same researchers who will further innovate it. We're proud to say that many of these talented individuals work at ESET, helping protect users from future threats.

The whole series:

Editorial: Fighting post-truth with reality in cybersecurity

What is machine learning and artificial intelligence?

Most frequent misconceptions about ML and AI

Why ML-based security doesn’t scare intelligent adversaries

Why one line of cyberdefense is not enough, even if it’s machine learning

Chasing ghosts: The real costs of high false positive rates in cybersecurity

How updates make your security solution stronger

We know ML, we’ve been using it for over a decade