Nuclear Elephant: Review of Gordon Cormack's Study of Spam Detection

HREF="http://nuclearelephant.com/papers/cormack.html"> HREF="http://www.nuclearelephant.com/papers/cormack.html"> Nuclear Elephant: Review of Gordon Cormack's Study of Spam Detection HREF="http://web.archive.org/web/20041012031359/http://nuclearelephant.com/base.css"> HREF="http://web.archive.org/web/20050216224823/http://www.nuclearelephant.com/base.css">

Papers

Review of Gordon Cormack's Study of Spam Detection
Last Update: Saturday, June 26 2004, 22:00

HREF="http://www.nuclearelephant.com/projects/dspam/"> HREF="http://dspam.nuclearelephant.com"> BORDER=0 ~~SRC="http://www.nuclearelephant.com/projects/dspam/dspam-button.gif"~~ SRC="/images/dspam-button.gif" ALIGN=LEFT HSPACE=5> Jonathan A. Zdziarski
jonathan@nuclearelephant.com

Introduction

Many misled CS students, Ph.Ds, and professionals have jumped on the spam filtering bandwagon with the uncontrollable urge to perform misguided tests in order to grab a piece of the interest surrounding this area of technology. As quickly as Bayesian filtering (and other statistical approaches) have popped up, equal levels of interest arose among many major groups (academic, press, and sysadmins to name a few). With regret, much of the testing published on the Internet thus far has been not much more than zeal without wisdom. This is due to the technology still being held fairly close to the vests of those implementing it (a result of its freshness and artistic complexity). This is not necessarily the fault of the testers; statistical filtering has grown to become much more than "A Plan for Spam", but unfortunately there is little useful documentation on the actual implementation and theory behind the latest improvements (one of the reasons I have written a book on the subject, scheduled for December 2004).

This article is a response to a research paper by Gordon Cormack and Thomas Lynam entitled, "A Study of Supervised Spam Detection applied to Eight Months of Personal E-Mail". This paper was recently featured on Slashdot which has unfortunately resulted in a large swarm of geeks with wrong information, drawing wrong conclusions about statistical filtering. It is not my desire to flame the test or the testers, but there are many errors I believe need to be brought to light. The testing fails in many ways, and has unfortunately marred many superior statistical filters, such as CRM114 (which has proved statistical superiority time and time again). CRM114 isn't my puppy, but I do believe it to be one of, if not the, most accurate filters in the world. I haven't tried to get into very deep philosophical problems with the testing, although there are some, but I've tried instead to provide a list of reasons the testing was performed incorrectly. Perhaps after reading this, the researchers might make another attempt to run a successful test. Until then, I'm afraid these results are less than credible.

Don't get me wrong, I'm glad to see that all the spam filters tested did very well. When we're measuring hundredths of a percent of accuracy, though, good enough doesn't cut it. The intricacies of testing can easily throw the results off a point or two which makes a considerable impact on the results. All of these filters do an excellent job at filtering spam, as proved by this test and others, but the test failed to conduct an adequate comparison due to many flaws. Paul Graham has spoken of conducting a bake-off at the next MIT Spam Conference. Hopefully, these tests provided some "gotchas" to watch for.

You may be wondering why I've taken the time to write an article about a failed test. Well, as of today the only successful way to beat a Bayesian filter is not to run one. Spammers can't get around the technology but by means of bad press, and if tests like this cause people to draw the incorrect conclusion that statistical filtering is ineffective, nobody's going to run one. Statistical filtering has shown, if anything, to be both adaptive and highly accurate. Unfortunately, people don't know how to test it yet in a way that will simulate real-world results.

The Challenge of Testing

Statistical filtering is considered by most as the next generation technology of spam filtering. Statistical filtering is dynamic, in that it learns from its mistakes and performs better with each correction. It has the unique ability to learn to detect new types of spams without any human intervention. This has all but obsoleted the many heuristic filters once considered mainstream, and even tools such as SpamAssassin have incorporated statistical components into their filters.

Statistical language analysis is unlike any other spam filtering approach we've seen and because of this, tests different from any other beast. The testing approaches used to measure heuristic spam filters are frequently and erroneously applied to statistical filters resulting in poor testing results. The intricacies of machine learning require a more scientific approach than simply throwing mail at the filter, and even the most detailed approaches to testing such a tool only barely succeed in accomplishing a real-world simulation.

Modern day language classifiers face a very unique situation - they learn based on the environment around them. The problem is therefore ~~one~~ a delicate balance of ~~extremely~~ controlled environment. When heuristic filtering was popular, there were many different ways to test it. Since the filter didn't base its decisions on the previous results, an accurate set of results would accommodate just about any type of testing approach used. The state of a statistical language classifier is similar to that of a sequential circuit in that the output is a combination of both the inputs and the previous state of the filter. The previous state of the filter is based on a previous set of inputs, which are based on a previous set of results, and so on.

Think of controlled environment in terms of going to the supermarket every week; what you buy from visit to visit is based on what you have in your refrigerator. A single change in the ~~environment (milk going sour)~~ environment, such as a snowstorm, can easily snowball to affect the results of ~~a filter~~ your refrigerator by many ~~messages,~~ twinkies and ~~change~~ affect your ~~milk buying patterns~~ grocery purchases for ~~many~~ a few weeks. With this in mind, the challenge of testing is to create an environment that simulates as closely as possible a real-world behavior - after ~~all~~ all, the accuracy we are trying to measure is how the filter will work in the real world. It is therefore suffice to say that testing a statistical filter is no longer a matter of testing, but one of simulation. Simulating a real-world behavior takes many factors into consideration that obsolete heuristic testing doesn't.

Of course, when you're not testing to measure accuracy, this type of simulation isn't always necessary. Chaos in message ordering and content may be appropriate when testing to compare features for a particular filter or any other kind of blind test where accuracy isn't is important as deviation, but that's not what this paper was testing. Some of these requirements were relayed to Cormack, and he did a fair job of implementing some of them. Others, he wasn't so lucky with.

Message Continuity

Message continuity refers to the uninterrupted threads and their message content, dealing specifically with the set of test messages used. Threading is important to a statistical filter as is message ordering. Statistical filters learn new data incrementally as it is mixed with already known data. As email evolves (spam and legitimate mail alike), characteristics slowly change. If the messages are presented out of order, incremental learning breaks. ~~This does not appear to have been a problem with the tests, as Cormack claimed~~ A corpus where random messages are hand-picked will lead to ~~track~~ the original ordering.

Archive Window

As we've learned through Terry Sullivan's research at degradation of the ~~MIT Spam Conference in 2004,~~ filter.

Consider this simplified example. Lets say message A is a spam ~~evolves on~~ about rolex watches. Your filter learns that words like "Rolex" are spam. Message B then comes in, which is an email from a friend who is in the ~~order~~ jewelry business. He talks about a few different types of ~~months. Other tests confirm this and show us that~~ jewelry he's ordered for the ~~seas~~ week. The message is accepted by the filter as legitimate, but only by the skin of its teeth because he mentioned Rolex watches. In message C, you've decided to talk a little bit about Rolex watches as you've always thought about getting one. Now, if message B was learned by the filter then by now Rolex watches will be considered relatively neutral (could be spam, could be ham), because you've received legitimate communication about it. On the other hand, if message B was plucked out of the corpus so that it was never trained, then message C is going to be bit-bucketed because the filter never learned that rolex emails might not be spam.

In real-life, there are many message that make it in by the skin of their teeth (e.g. with a low confidence). A single token could, in some cases, affect the final result of a message. If the tester fails to train all of the messages from a users' corpus, they're going to be omitting data which could play a crucial role in evaluating "borderline" messages correctly. It's unclear as to whether Cormack maintained message continuity, as the corpus isn't available for analysis. It is implied that he did, but not stated.

Archive Window

As we've learned through Terry Sullivan's research at the MIT Spam Conference in 2004, spam evolves on the order of months. Other tests confirm this and show us that the seas change every 4 to 6 months on average. It's important to have a concurrent archive of mail if we're going to use one archive for training. ~~This~~ Many tests out there today fail to use an adequate archive window, and try to test on two weeks of mail. What many fail to realize is that adaptive learning isn't really associated with quantity of mail, but rather time span to which the filter is capable of learning permutations. Repeating the same data over and over again won't teach a statistical classifier how to identify spam any better than putting beets on the table every night will teach a child to like beets more. Only after the filter is able to see the many different lexical permutations in a corpus spanned across several months will it be able to effectively filter at high levels of accuracy.

Cormack's test used incremental learning without a training corpus, and so this ~~too was not~~ didn't appear to be a significant issue - although I do believe his method of training in general was flawed by not using an archive window for pretraining (discussed later).

Purge Simulation

An area frequently disregarded in a statistical learning simulation is the purging of stale data. When the training corpus is learned, each message is trained within the same short period of time (usually a period of several minutes or hours). The usual method of purging that a particular filter might employ doesn't take place because all of the data trained is considered new. Purging is important because older data can affect the polarity of newer data. For example, if tokens that haven't been seen in four months reflect a spammy polarity, then are purged, and a couple new messages come in using those tokens in a legitimate way, then purging will allow the tokens to take on their most recent polarity. Without purging, the tokens would become fairly neutral and be eliminated from computation, ultimately affecting accuracy.

Another area purging affects (which directly affects accuracy) is the amount of data required to migrate the polarity of tokens in the dataset for training or retraining. If an old record exists for a particular token with 100 data points, that record will take much more time to change polarity than a fresh record or one with few datapoints. By not purging the old data, it not only lingers but it causes the tokens to "stick" much more.

This test did not include any purge simulation at all, leaving 49,000 someodd messages trained as a composite in the wordlist.

Interleave

The interleave at which messages from the corpus are trained, corrected, and classified can play a dramatic role in the results of the test. Many tests are erroneously performed by feeding in two separate corpora - one of legitimate mail and one of spam. Some tests use a 1:1 interleave, while others try their best to simulate a real-world scenario. ~~The~~

Going back to our original test on message continuity, think about the affect of training only one nonspam about jewelry before focusing on rolex watches as opposed to training ten. If you discussed rolex watches at length before receiving more rolex spam, then chances are the messages will be classified correctly. On the other hand, interrupting this flow will, at the very least, cause the filter to respond in an artificial way rather than how it would in real-world scenarios.

The original ordering of the messages in the corpus will generally yield the most realistic results. Cormack claims to have preserved the message ordering, and Lynam confirmed recently that the spam and ham was kept in the same file, so it looks like interleave was preserved in this test.

Corrective Training Delay

The delay in retraining classification errors is probably one of the most difficult characteristics to simulate. When a misclassification occurs, the user doesn't immediately report it - several other messages are likely to come in before the user checks their email and corrects the error. What's more, submitting an error changes data - which could cause more errors in some cases. Delay creates ehther a snowballing affect or a delay in mistraining the database. The result can be good or bad, but nevertheless, it's critical to an accurate simulation. The test simulation retrained immediately when an error has occured, not allowing the error to propagate or affect any other decisions. This most likely resulted in inaccurate results - especially difficult to spot where heuristic functions and statistical functions were used together.

The tests performed in Cormack and Lynam's paper wasn't a true simulation. Only some of the criteria I listed above were fulfilled. While the original message ordering and possibly the interleave were preserved, no purging was performed and no training delays were used. What's more, there was no archive window, because no initial training was performed before taking measurements, which grossly threw off the results of the testing.

A Closed Test

The scientific process demands peer review. Cormack has refused to make his test code (even without the mail corpus) or his configuration log or other notes available. This makes the test very hard to trust as nobody is able to really look inside and validate his work, or find bugs in his code. His tests assume that he hasn't made any errors in implementation which is most likely very incorrect. In order for any scientific test to be valid, it must be reviewed by an independent party (or many parties). If these tests had been made public, it wouldn't surprise me to see much more public support and possibly even some contributions by developers (including myself) to make the tests better.

I am very suspicious of any closed test, regardless of the results. Since the filter authors were not directly involved in these tests, their reliability is limited to the extent that they were implemented correctly. I'm afraid I can't give any credibility to a test that cannot be reviewed.

Old Versions of Software Were Used

If you're going to conduct tests, for the sake of science use recent versions of the software. It was reported that version 2.8.3 of DSPAM was used - v2.8 is well over two major production releases and six months old! In fact, version 2.10 had been released for a month when I was originally contacted by Cormack in April, but appeared to still be using a version from January (I'm only finding this out now). 3.0.0 had also been under development for about three months prior to its release, with public betas available. As I understand it, older versions of other software were also used, such as Bogofilter (0.17).

At the very best, this test shows us the state of spam filtering from early releases of these tools, which are more than six months old - even had this test been conducted without errors, it is already obsolete as of its publish. At least this is based on the fact that he was using software six months old (from January '04 - Mar '04). His article claims the corpus started in August 2003, but that doesn't make much sense as that would put the date of his test at around March or April '04, which of course outdates the software he was using. So either he used software much older than what was available (even in production) or his numbers are wrong.

If this is the case, then the software pre-dated the spam that was being used to measure the filter. This is a big no-no in testing. If you test 2004 safety features, you don't test with 2003 vehicles, unless you are specifically testing for the effectiveness of _older_ models in comparison to newer ones (and they weren't). In this test, the versions of software used should be just as recent as the mail archived for testing. While statistical filters are excellent at learning new types of spam, many unrelated tactics also affect the filter software, such as new encoding tricks and such - things that require minor tweaking of the software.

The Test Subject was Anything but Typical

The introduction to the research paper makes the following statement:

"While our study is limited to the extent that X's email is typical"

Yet later on in the paper we read that X's email consisted of over 49,000 emails over a period of eight months. The paper also makes this statement about X:

X has had the same userid and domain name for 20 years. Variants of X's email addresses have appeared on the Web, and in newsgroups. X has accounts on several machines which are forwarded to a common spool file, where they are stored permanently in the order received.

This seems very atypical which greatly limits the usefulnes of this study. The test subject does not represent typical email behavior, except among the most hardcore geeks. Even still, typical hardcore geeks will adjust this behavior in an attempt to curve spam. The typical technical user (someone who makes his living online) will have the same email address for perhaps five or more years, and the typical non-technical user (a majority of the users on the Internet, lest we forget) will change email addresses every couple of years. In either case, most users use one or two variants at the most. A good test should have included independent tests with corpora from 10-15 different test subject, of all walks of life - geek, doctor, etc. Since X's email isn't available for examination, we can only draw some assumptions, which make for a strong case that the test subject was not typical and may have helped provide skewed results:

Due to X's high volume of traffic and the fact that X's email addresses were available to harvest bots on the Web and in newsgroups for 20 years, it is no surprise that X has an abnormally high spam ratio, 81.6%. The typical user has a spam ratio of perhaps 60% with 80% being very high (including geeks who have had the same email address for years). Having an abnormally high spam ratio, many spam filters are likely to perform at less-than-optimal levels without basic tuning. This is for two reasons. First, an overabundance of spam can seed the filter's wordlist with tokens that would otherwise be considered legitimate, but because the user only receives a small percentage of legitimate mail, these words become "flooded", even in a Train-on-Error situation. This can leave a wordlist with an underabundance of legitimate tokens due to this flooding. Secondly, the algorithm used to calculate token value in most statistical filtering approaches relies to some degree on compensating for an unbalanced corpus. Having an unusual ratio of spam (+80%) can cause some filters to overcompensate (unless tuned properly) and result in less than optimal levels of accuracy. The same is true in the other direction as well - an overabundance of legitimate mail with very few spams will result in a significant number of spam misses due to these algorithms overcompensating. In practice, users with a massively unbalanced corpus of mail, such as the test subject, would need to perform some additional tuning and possibly corpusfeeding in order to achieve optimal results.
The test subject used many different variants of email addresses, which provided many different variants of header information to analyze, and possibly a very unbalanced set of data. For example, if X had 20 email addresses, but only used 4 of them for day-to-day legitimate mail, then that means a statistical spam filter would learn that any mail addressed to the other 16 would most likely be spam - providing an unbalanced set of header information to analyze. For the occasional (low-traffic) messages filtering into one of these 16 boxes, this is a death sentence. Typical users do not have more than a few variants, and typically only ones that they are using to receive legitimate mail. In fact, if a user is experiencing poor levels of accuracy one of the first things I ask them is if they have a bunch of unused email addresses active - whenever one does, and we turn it off, accuracy improves.
With over 49,000 messages in eight months, it is reasonable to say that this user was very active on email. The fact that he/she was on newsgroups and the web suggests an overly diverse email behavior, which requires at least a few different options to be selected for that user. TOE mode is terrible at identifying new kinds of email behavior - they should have used Train-Everything or Train-until-Mature for all filters. Cormack makes the claim that DSPAM and CRM114 don't support Train-Everything mode, but it is in fact the default for DSPAM, and is supported by CRM114. For DSPAM v3.0.0, I would have recommended trying both Train-Everything and Train-until-Mature.

The Accuracy of the Test Subject's Corpus is Questionable

The research paper claims that the mail corpus was run against SpamAssassin for classification and compared to existing results. Any discrepencies were resolved by a human. This fails to account for two issues, which question just how accurate the corpus was in the first place:

It appears from the wording of the paper that messages which SpamAssassin believed were correct in both runs were accepted without human review. The fact that the same program was used to determine the results of the corpus suggests it is extremely likely that both versions of the software could make the same mistakes repeatedly, as is the problem with monocultural spam filtering. This would go unchecked. SpamAssassin, with learning turned off, is advertised to be only as accurate as 95% (making 1 error for every 20 messages). Instead, what the testers should have done is run a different spam filter, or perhaps two or three other filters on the corpus to determine conflicts requiring human attention. You can't use a less accurate tool to prepare a test for a more accurate tool!

NOTE: The paper does make claims that discrepencies noted during the tests were examined, but this appears only to be an afterthought. The gold standard itself was described as being initially set between two copies of SpamAssassin and the user. Lynam claims that tests were re-run if an error was found, but it doesn't seem as though the testers would have been looking for errors during this phase, as the gold standard had already been established. Ideally, all of this should've been a part of defining the gold standard in the first place.
All conflicts were resolved by a human (assumably the test subject) after the fact; e.g. they were not resolved during the eight month period in which mail was collected. The fact that there were any conflicts in the first place prove that the test subject was not very accurate at manually classifying their own mail - if they were, there would not be any conflicts. The accuracy of the test subject would obviously diminish eight months after the fact, so if they weren't very accurate to begin with, there were most definitely many human errors made when conflicts were resolved. Bill Yerazunis did a study of human accuracy and classified his mail several times by hand. He came to the conclusion that humans were typically only 99.84% accurate. If the corpus was, at the most, 99.84% accurate (and this is generous being that it was classified eight months later) then that means that any tools which were more accurate would spot these errors, and appear as erroring themselves.

The Corpus was Classified by SpamAssassin, for SpamAssassin

SpamAssassin is immediately eliminated from the credibility of these results because the test corpus was classified by SpamAssassin (twice) and the test was ultimately a product of SpamAssassin's decisions. Everyone knows that computers do exactly what they're told to do. Even if SpamAssassin made 1,000 errors, it would most likely make some of those errors again even with a learning piece enabled. Regardless of the accuracy of the corpus, SpamAssassin was tailored specifically to act as the referree for the mail corpus being used, and therefore will obviously provide the desired results (or similar, with learning enabled). If you use a tool that is only 95% accurate to prepare a test for tools that are 99.5% accurate, then the lesser tool will appear to outperform the better tools whenever the better tools are correct. This could have been avoided had many filters been used to classify the corpus, or had the test been limited to a manageable number of emails.

Furthermore, heuristic functions were designed to specifically detect the spams used. The emails being 8 months old, heuristic rules were clearly updated during this time to detect spams from the past eight months. The tests perform no analysis of how well SpamAssassin would do up against emails received the next day, or the next eight months. Essentially, by the time the tests were performed, SpamAssassin had already been told (by a programmer) to watch for these spams. A fair test should have used a set of heuristic rules eight months older than the corpus of mail or more (note, I didn't say software 8 months old). In all likelihood, many of the spams collected at the beginning of the test had already had rules specifically coded for them, so it's possible going back 12 months might be necessary to remove the human programming. What good is a test to detect spam filter accuracy when the filter has clearly been programmed to detect its test set?

Pretraining Existed for Some Tests, Not Others

Of course, this raises the issue that SpamAssassin was the equivalent of a pre-trained filter, while all the other filters were not trained. A significant amount of pre-intelligence was embedded into SpamAssassin prior to testing (rules written by a human specifically to detect these spams). If we are to measure pre-trained filters, they should be pitted against other pre-trained filters. The testers argue that the entire process was graphed. Unfortunately, this is not sufficient. All of the filters were measured from their starting point - of which SpamAssassin was given the obvious advantage by being pre-trained. Results should not have been measured until each filter was sufficiently trained as well. Also, pre-training is very different from learning. When filters learn, they learn differently than they do when they pre-train. For example, SpamProbe and DSPAM perform test-conditional (or iterative) training, which re-trains certain tokens until the erroneous condition is no longer met. When these filters are pretrained, however, the tokens are trained only once for each message. This leaves the dataset in a very different state.

This seems to be part of the failure of this test. Many filters have an initial training cycle in which many features are disabled. DSPAM specifically disables many advanced features such as Bayesian Noise Reduction until 2500 innocent messages are learned - it likes to play it very safe unless told otherwise by the user (who will usually wait, turn the knob, or train a corpus of mail). More importantly to the DSPAM results is an algorithm called statistical sedation which is a tunable feature that waters down filtering until the training cycle is complete - in order to prevent false positives. Users who would like better accuracy on day 1 can turn this knob in one direction. It doesn't appear this feature was disabled in the tests (which would obviously explain his weird regression curve for DSPAM), nor does it appear that any acceptable levels of training were performed before taking measurement. This is probably what resulted in the mediocre results of many filters.

Bill Yerazunis had the excellent idea of performing two tests: the first test measured the ability to detect spam, but then the second test would flip around the corpus (what is spam is now ham, what is ham is now spam), and the filter would then be instructed to detect spam again. The _worst_ of the two tests are the results to be used. This would remove any pre-intelligence programmed into any filter and measure them based on their ability to detect what the user tells them is spam. Unfortunately, it doesn't look like this will be incorporated into the testing.

Closed Test + No Filter Author Involvement = Many Potential Misconfigurations

Mr. Cormack approached me some months ago wanting to perform some of these tests for a paper he was writing. Understandably, Mr. Cormack was very frustrated about not being able to achieve any reasonable levels of filter accuracy from his filters, including DSPAM. It turned out that he was using the wrong flags, didn't understand how to train correctly, and seemed very reluctant to fully read the documentation. I don't mean to ride on Cormack, but proper testing requires a significant amount of research, and research seems to be the one thing lacking from this research paper.

Another thing that concerns me is the level of experience Mr. Cormack has with statistical filters. Cormack argues that he's used SpamAssassin and Mozilla, with a little bit of experience on some others...but this doesn't seem like sufficient experience with a command-line, server-side _pure statistical_ filter to me, at least ones like those he measured, which require a much different level of experience than these tools. It does concern me that it may have been out of frustration that Cormack decided to test with a six month old version of DSPAM (instead of the version available when he contacted me, only one month old) and used several flaming words such as "inferior" in his paper to describe the software.

I'm not trying to slander Mr. Cormack, but I think it's important to note that a closed test without enough experience isn't going to yield the desired results. This is why an open test is so important, as well as review. In my opinion, I do not believe Cormack has made the effort to become as knowledgeable in this area as he needs to be to run tests.

As with many rushed tests, the testers don't always find the time to become intimate with the tools they're testing. In this case, it was very obvious to me when originally speaking with Cormack that he was using the software incorrectly, but in his research paper it appears that the documentation was not adequately consulted even up to the test.

It states on page 30 of the research paper that DSPAM doesn't support Train-Everything mode, and that training was performed using Train-on-Error. Train-Everything mode was the first mode available in DSPAM, and Train-on-Error was only coded into the software as of version 2.10, so the test had to be using Train-Everything without knowing it, and treating it like TOE. (NOTE: I've been informed that Cormack would be correcting his paper to reflect this). Many other incorrect statements about the different filters he's using suggests to me that the testers still didn't understand the filters he'd been testing.

One of my last exchanges with Cormack before his testing involved his approach to training. It appeared as though errors were not being retrained correctly which I am confident made a significant contribution to his poor results with DSPAM. Instead of presenting a message as an error, It was submitted as a corpusfed spam. This would have learned the tokens as spam, but not un-learned the erroneous innocent hits on each token - so the message became learned as both ham and spam. The tests also failed to present the outputted message, but presented the original message for retraining. Unless specifically configured to do so (his copy was not), DSPAM looks for an embedded "watermark" it has added to each email it processes. This watermark provides a serial number referencing the original training data. When it cannot be found, only the message body is retrained (e.g. it tries to do its best assuming the user forwarded in the spam, and so you'd have their headers instead). By providing the original message for retraining, and not the DSPAM processed message (with watermark), DSPAM was (at the time at least) training only half of each message - leaving the headers without retraining. I provided him with this information, but I'm not entirely certain that the corrections made were sufficient prior to testing. This is understandably confusing, and was ambiguous enough of a "feature" that it was removed in v3.0.0, but because the documentation was not followed, most likely caused unpredictable results in testing and practice.

In fact, we really don't know how these tools were configured or what backends they used. DSPAM supports six different possible back-ends (some of which are beta, and some of which are unsupported), as well as three different training modes (TOE, TEFT, and TUM). DSPAM also supports Graham-Bayesian, Burton-Bayesian, Geometric Mean, and Robinson's Chi-Square algorithms. On top of this, there are two different algorithms for computing token value, and plenty of other knobs. We have no idea how each tool was specifically configured nor did anyone involved in testing appear to post configurations or specific details about their testing approach.

The Tests Lack Real-World Validation

The tests don't come close to reflecting real-world levels of accuracy experienced by many filters. CRM-114 users experience typical levels of accuracy surpassing 99.95%, yet the tests show otherwise. The same is true for many other filters. Even the ones that were rated fairly did not reflect the real-world accuracy I hear about. When the results of a test don't even come close to human experience, the tests are possibly erroneous and need to be analyzed, not published. If the tests are reviewed by one or two independent parties, retried, and nothing can be found wrong with them, THEN publish them - even if they go against the accepted performance...but that's not the case here. No retrials were performed, no independent party confirmed the validity of these tests, and as a result we ended up with some very oddball results. Please note again, CRM114 is not my tool - I'm not affiliated with it in any way except that I hold it in very high regard having seen how it functions mathematically and am quite certain of its mathematical superiority in both theory and practice. I suspect these tests could have been compared at one point to Cormack's own experience, which as he informed me were very poor with statistical filtering - this was due primarily to poor implementation on Cormack's part, having analyzed his own configuration personally.

Conclusions

Many technical errors have been made in this test, so many in fact that the test is beyond recovery in my opinion. I believe a new test is in order - one that corrects the deficiencies outlined in this article, and more importantly an open test that the filter authors can be involved in. As a result of the many technical deficiencies and the general mystery behind how these tests were really performed, I do not believe these tests to be credible - especially not credible enough to appear in any journal.

Mr Lynam seems much more interested in the scientific process and far less argumentative. I would be interested in seeing him pair up with a different party to conduct a newer, better test.

I sincerely hope these errors are considered and improved in their testing. I am confident that they will find tools such as CRM114 and DSPAM to prove as extremely accurate as their loyal users are finding them. I'm also confident that every last one of the statistical filters measured will prove superior.

Reproduction prohibited without permission