It is generally well-understood that antimalware programs—the software which detects computer viruses, worms, trojan horses and other threats to your system—work by scanning files using signatures they already have. A signature could be as simple as a string[i] (like using the "find" command in your word processor to locate a particular piece of text) or as complex as a tiny macro or subroutine which tells the scanning engine what to look for and where to find it.

Signature scanning works very well for detecting threats which have already been identified but how do antimalware programs detect new, previously unseen threats?  One of the methods used is heuristics. But what are heuristics, and how do they work? Randy Abrams, ESET's Director of Technical Education, finds the following definitions helpful in explaining heuristics:

Heuristic (from the Greek "Ε?ρ?σκω" for "find" or "discover") is an adjective for experience-based techniques that help in problem solving, learning and discovery.
SourceWikipedia
heuristic - a commonsense rule (or set of rules) intended to increase the probability of solving some problem
Source: Princeton University Wordnet
And for computer science: Heuristic - In computer science, a heuristic algorithm, or simply a heuristic, is an algorithm that is able to produce an acceptable solution to a problem in many practical scenarios, in the fashion of a general heuristic, but for which there is no formal proof of its correctness.
Source: Wikipedia

The science of heuristics studies how information is discovered and learned. It explains how one looks at problems and finds solutions to them by induction (as opposed to deduction). Often, a heuristic is a "rule of thumb" one might have learned.

In computer science, a heuristic is an algorithm which consistently performs quickly and/or provides good results. But for antimalware software, heuristics can also have a more specialized meaning: Heuristics refers to a set of rules—as opposed to a specific set of program instructions—used to detect malicious behavior without having to uniquely identify the program responsible for it, which is how a classic signature-based "virus scanner" works, i.e. identifying the specific computer virus or other program.

The heuristic engine used by an antimalware program might include rules for the following:

  • a program which tries to copy itself into other programs (in other words, a classic computer virus)
  • a program which tries to write directly to the disk
  • a program which tries to remain resident in memory after it has finished executing
  • a program which decrypts itself when run (a method often used by malware to avoid signature scanners)
  • a program which binds to a TCP/IP port and listens for instructions over a network connection (this is pretty much what a bot—also sometimes called drones or zombies—do)
  • a program which attempts to manipulate (copy, delete, modify, rename, replace  and so forth) files which are required by the operating system
  • a program which is similar to programs already known to be malicious

Some heuristic rules may have a heavier weight (and thus, score higher) than others, meaning that a match with one particular rule is more likely to indicate the presence of malicious software, as are multiple matches based on different rules.

Even more advanced heuristics might trace through the instructions in a program’s code before passing it to the computer’s processor for execution, allow the program to run in a virtual environment or "sandbox" to examine the behavior performed by and changes made to the virtual environment and so forth. In effect, antimalware software can contain specialized emulators that allow it to "trick" a program into thinking it is actually running on the computer, instead of being examined by the antimalware software for potential threats.

Keep in mind while the term "program" was used above, it does not necessarily mean executable programs such as .COM files or .EXE files. A heuristic engine could be examining processes and structures in memory, the data portion (or payload) of packets travelling over a network and so forth.

Likewise, a heuristic engine does not simply scan through files like a classic antivirus program looking for known patterns. It might trace through the instructions in a program before passing the code to the processor for execution, allow the program to run in a virtual environment or "sandbox" and examine the behavior performed in and changes made to the virtual environment and so forth.

The advantage of heuristic analysis of code is it can detect not just variants (modified forms) of existing malicious programs but new, previously-unknown malicious programs, as well.  Combined with other ways of looking for malware, such as signature detection, behavioral monitoring and reputation analysis, heuristics can offer impressive accuracy. That is, correctly detecting a high proportion of real malware yet exhibiting a low false positive alarm rate as well, since misdiagnosing innocent files as malicious can cause severe problems.

Understanding how heuristics work can be something of a specialty in the antimalware field. If you are interested and would like to know more about this field, I would suggest the Heuristic Analysis—Detecting Unknown Viruses white paper written by David Harley and Andrew Lee. For more technical examination of anti-malware technology, Peter Szor’s book The Art of Computer Virus Research and Defense, though several years old, is still worth reading.

Aryeh Goretsky, MVP, ZCSE
Distinguished Researcher

 


[i] In fact, the process is a lot more sophisticated than that, nowadays, but that is a starting point for understanding the first principles of signature scanning.