We have just come across a Buyer’s Guide published in the March 2010 issue of PC Pro Magazine, authored by Darien Graham-Smith, PC Pro’s Technical Editor. The author aims to give advice on which anti-malware product is the best for consumer users, and we  acknowledge that the article includes some good thoughts and advice, but it also contains several significant methodological flaws: in fact, we were a little taken aback at some of the testing methodologies used. It seems that all the testing was performed exclusively in-house, and we think that if the testing was conducted by a specialist testing organization with years of experience focused primarily on objective anti-malware testing, the results arrived at might well be very different and would be more convincing. We would like to respectfully point out some problematic assumptions and methods used in the March issue.

When testing the product’s detection, namely its ability to protect against threats, flawed methodology was used. As an example, we can pull a quote:

“Every file has been positively identified as dangerous by at least four packages, so a good suite should detect most of them.”

This seems ok, right? But wait: there was no direct validation as to whether the samples constitute actual malware or not, i.e. whether they were validated as malicious or innocent.

There are at least two false assumptions here. The first is that you can validate samples accurately, simply by running them past one or more scanners and seeing if they detect them. Well, Mr. Graham-Smith is correct in thinking that he reduces the risk of false positives by requiring at least four scanners to identify each sample as malicious. However, he doesn’t eliminate it. It’s by no means unknown for an incorrect detection to be cascaded from one vendor to others if those vendors don’t re-validate them. As more vendors move towards an “In the Cloud” model of detection by reputation, driven by the need to accelerate processing speed, it’s easy for a false positive to spread, at least in the short term. At least some of the files identified could have gotten into the testing sample from a database provided by one or more of the vendors and was subsequently falsely detected by the heuristics as a virus.

However, there’s an even greater problem.

When a detection test uses default installation and configuration options, as was done in this test, it’s particularly important that samples are not only correctly identified, but also correctly classified. This is because all scanners do not treat all classifications of malware in the same way. While all scanners take similar approaches to out-and-out malicious programs such as worms and viruses, bots, banking Trojans and so on, there are other types of application, such as some examples of adware, that can’t be described as unequivocally malicious.

Similarly, some legitimate programs may use utilities such as packers and obfuscators, and it’s not appropriate to assume that all anti-malware products treat such programs in the same way. Some assume that all such programs are malicious, but others discriminate on the basis of the code that’s present, not just on the presence of a packer. These “grey” applications and ambiguous cases may be classified as “Possibly Unwanted”, “Potentially Unsafe”, or even “Suspicious”.

Unlike many professional testing organizations, PC Pro does not consult with vendors about such issues as configuration before a test, and it does not give “missed” samples from its tests to the publishers of the products it tests. However, and to his credit, Darien Graham-Smith quickly responded to a request for further information with a list of file hash values for the samples he says we missed (18 out of 233), and in all cases but one, the detection name given to it by one of our competitors. (A file hash such as an MD5 uses a cryptographic function to compute a value for a file that is unique to that least in principle. In fact, it is possible – though very rare – for two files to have the same hash value – we call this a hash collision.) This enabled us to check our own collection for files corresponding to the sample set used by PC Pro.

When checking samples that the magazine claims we missed, we found some anomalies in the samples set. The random nature of the sample selection (including such oddities as a Symbian Trojan, an anomalous file version of a 1989 boot sector virus, packer detections, a damaged sample detection, and a commercial keylogger) gives serious cause for concern. We even found samples detected by some of our competitors by names like “not-a-virus: RemoteAdmin.PoisonIvy”. With fuzzy classifications like these, it’s unsurprising that many of these cases are not detected by default by all scanners. But where such samples are used, as was the case here, the accuracy of the test is compromised, since it introduces a bias in favour of products that don’t discriminate between possibly malicious and unequivocally malicious applications.

The Anti-Malware Testing Standards Organization (AMTSO - http://www.amtso.org) was established in May 2008 with the exact intention of reducing unprofessional testing, skewed methodologies and resultant flawed results. Its status is strictly that of an international non-profit association focused on addressing the universal need for improvement in the objectivity, quality and relevance of anti-malware testing. Principle 5 of the AMTSO document “Fundamental Principles of Testing” (http://www.amtso.org/amtso---download---amtso-fundamental-principles-of-testing.html) states:

Testers must take reasonable care to validate whether test samples or test cases have been accurately classified as malicious, innocent or invalid.

It has often been the case in the world of Antivirus testing that seemingly reliable testing results were, in fact, not valid, because the samples used in the tests were misclassified. For example, if a tester determines that a product has a high rate of false positives, that result could be inaccurate if some samples were wrongly classified as innocent. Thus, it is our position that reasonable care must be taken to properly categorize test samples or test cases, and we especially encourage testers to revalidate test samples or test cases that appear to have caused false negative or false positive results.

Similarly, care should be taken to identify samples that are corrupted, non-viable or that may only be malicious in certain environments and conditions.

Yet another question that arises with regard to PC Pro and its testing methodology is the small sample size of 233 used in the test, and how the files were obtained. As the PC Pro validation of the test samples did not meet professional standards, there is no way any authoritative conclusions can be drawn from this test, as far as the products’ detection is concerned.

The other detection testing method used by PC Pro was a dynamic test of web threats. The methodology of dynamic testing of infected websites is very problematic to say the least (http://www.amtso.org/amtso---download---amtso-best-practices-for-dynamic-testing.html). We borrow a PC Pro citation to illustrate this:

“For this month’s web-based test, we visited several hundred dodgy-looking websites. We identified 54 of them as potentially malicious, because those pages caused at least one security product to throw up an alert.”

This is problematical, in that it suggests an immediate bias in that the validity of single product alerts is assumed without question.

It also has to be said that the web changes constantly, which means that web-hosted threats also change. So a question arises: Has the tester used 15 parallel computers to test all PCs and solutions against a single site, serving the same malware, at exactly the same moment? Only if this principle was upheld can consistent results be ensured for each tested product.

The method used here seems very questionable: malware loaded on the web may change at very short intervals and so may be different with every time it’s accessed. Moreover, the tester has failed to validate the websites as really malicious. And yet he goes ahead and draws conclusions regarding the performance of these tested products based on these questionable parameters. In the methodology used, the author fails to identify which web sites are dangerous, harmless, or even offline.

We will shortly address problems in the test's methodology as regards product performance other than raw detection testing in another blog. We have also asked Pierre-Marc Bureau and David Harley for more information on their expert analyses of the sample set used.

Ján Vrabec
Security Technology Analyst, ESET