Nuclear Elephant: Response to Gordon Cormack's Study of Spam Detection

HREF="http://web.archive.org/web/20040626001743/http://www.nuclearelephant.com/base.css"> HREF="http://web.archive.org/web/20040703223506/http://www.nuclearelephant.com/base.css">

Papers

Response to Gordon Cormack's Study of Spam Detection
Last Update: ~~Wednesday,~~ Saturday, June 23 26 2004, 22:00

Jonathan A. Zdziarski
jonathan@nuclearelephant.com

Introduction

Many misled CS students, Ph.Ds, and professionals have jumped on the spam filtering bandwagon with the uncontrollable urge to perform misguided tests in order to grab a piece of the interest surrounding this area of technology. As quickly as Bayesian filtering (and other statistical approaches) have popped up, equal levels of interest ~~arose.~~ arose among many major groups (academic, press, and sysadmins to name a few). With regret, much of the testing published on the Internet thus far has been not much more than zeal without ~~wisdom~~ wisdom. This is due to the ~~fact that the~~ technology is still being held fairly close to the vests of those implementing it (a result of its freshness and artistic complexity). This is not necessarily the fault of the testers; statistical filtering has grown to become much more than "A Plan for Spam", but unfortunately there is little useful documentation on the actual implementation and theory behind the latest improvements (one of the reasons I have written a book on the subject, scheduled for December 2004).

This article is a response to ~~Gordon Cormack's~~ a research paper by Gordon Cormack and Thomas Lynam entitled, "A Study of Supervised Spam Detection applied to Eight Months of Personal E-Mail". ~~Cormack's~~ This paper was recently featured on Slashdot which has unfortunately resulted in a large swarm of geeks with wrong information, drawing wrong conclusions about statistical filtering. It is not my desire to flame the test or the ~~tester,~~ testers, but there are many errors I believe need to be brought to light. The testing fails in many ways, and has unfortunately marred many superior statistical filters, such as ~~CRM114, which~~ CRM114 (which has proved statistical superiority time and time ~~again in practice.~~ again). CRM114 isn't my puppy, but I do believe it to be one ~~of the (if~~ of, if not ~~THE)~~ the, most accurate filters in the world. I haven't tried to get into very deep philosophical problems with the testing, although there are some, but I've tried instead to provide a list of reasons the testing was performed incorrectly. Perhaps after reading this, the researchers might make another attempt to run a successful test. Until then, I'm afraid these results are less than credible.

Don't get me wrong, I'm glad to see that all the spam filters tested did very well. When we're measuring hundredths of a percent of accuracy, though, good enough doesn't cut it. The intricacies of testing can easily throw the results off a point or two which makes a considerable impact on the results. All of these filters do an excellent job at filtering spam, as proved by this test and others, but the test failed to conduct an adequate comparison due to many flaws. Paul Graham has spoken of conducting a bake-off at the next MIT Spam Conference. Hopefully, these tests provided some "gotchas" to watch for.

The Challenge of Testing

Statistical filtering is considered by most as the next generation technology of spam filtering. Statistical filtering is dynamic, in that it learns from its mistakes and performs better with each correction. It has the unique ability to learn to detect new types of spams without any human intervention. This has all but obsoleted the many heuristic filters once considered mainstream, and even tools such as SpamAssassin have incorporated statistical components into their filters.

Statistical language analysis is unlike any other spam filtering approach we've seen and because of this, tests different from any other beast. The testing approaches used to measure heuristic spam filters are frequently and erroneously applied to statistical filters resulting in poor testing results. The intricacies of machine learning require a more scientific approach than simply throwing mail at the filter, and even the most detailed approaches to testing such a tool only barely succeed in accomplishing a real-world simulation.

Modern day language classifiers face a very unique situation - they learn based on the environment around them. The problem is therefore one of extremely controlled environment. When heuristic filtering was popular, there were many different ways to test it. Since the filter didn't base its decisions on the previous results, an accurate set of results would accommodate just about any type of testing approach used. The state of a statistical language classifier is similar to that of a sequential circuit in that the output is a combination of both the inputs and the previous state of the filter. The previous state of the filter is based on a previous set of inputs, which are based on a previous set of results, and so on.

Think of controlled environment in terms of going to the supermarket every week; what you buy from visit to visit is based on what you have in your refrigerator. A single change in the environment (milk going sour) can easily snowball to affect the results of a filter by many messages, and change your milk buying patterns for many weeks. With this in mind, the challenge of testing is to create an environment that simulates as closely as possible a real-world behavior - after all the accuracy we are trying to measure is how the filter will work in the real world. It is therefore suffice to say that testing a statistical filter is no longer a matter of testing, but one of simulation. Simulating a real-world behavior takes many factors into consideration that obsolete heuristic testing doesn't.

Of course, when you're not testing to measure accuracy, this type of simulation isn't always necessary. Chaos in message ordering and content may be appropriate when testing to compare features for a particular filter or any other kind of blind test where accuracy isn't is important as deviation, but that's not what this paper was testing. Some of these requirements were relayed to Cormack, and he did a fair job of implementing some of them. Others, he wasn't so lucky with.

Message Continuity

Message continuity refers to the uninterrupted threads and their message content, dealing specifically with the set of test messages used. Threading is important to a statistical filter as is message ordering. Statistical filters learn new data incrementally as it is mixed with already known data. As email evolves (spam and legitimate mail alike), characteristics slowly change. If the messages are presented out of order, incremental learning breaks. This does not appear to have been a problem with the tests, as Cormack claimed to track the original ordering.

Archive Window

As we've learned through Terry Sullivan's research at the MIT Spam Conference in 2004, spam evolves on the order of months. Other tests confirm this and show us that the seas change every 4 to 6 months on average. It's important to have a concurrent archive of mail if we're going to use one archive for training. This test used incremental learning without a training corpus, and so this too was not a significant issue - although I do believe his method of training in general was flawed by not using an archive window for pretraining (discussed later).

Purge Simulation

An area frequently disregarded in a statistical learning simulation is the purging of stale data. When the training corpus is learned, each message is trained within the same short period of time (usually a period of several minutes or hours). The usual method of purging that a particular filter might employ doesn't take place because all of the data trained is considered new. Purging is important because older data can affect the polarity of newer data. For example, if tokens that haven't been seen in four months reflect a spammy polarity, then are purged, and a couple new messages come in using those tokens in a legitimate way, then purging will allow the tokens to take on their most recent polarity. Without purging, the tokens would become fairly neutral and be eliminated from computation, ultimately affecting accuracy.

Another area purging affects (which directly affects accuracy) is the amount of data required to migrate the polarity of tokens in the dataset for training or retraining. If an old record exists for a particular token with 100 data points, that record will take much more time to change polarity than a fresh record or one with few datapoints. By not purging the old data, it not only lingers but it causes the tokens to "stick" much more.

This test did not include any purge simulation at all, leaving 49,000 someodd messages trained as a composite in the wordlist.

Interleave

The interleave at which messages from the corpus are trained, corrected, and classified can play a dramatic role in the results of the test. Many tests are erroneously performed by feeding in two separate corpora - one of legitimate mail and one of spam. Some tests use a 1:1 interleave, while others try their best to simulate a real-world scenario. The original ordering of the messages in the corpus will generally yield the most realistic results. Cormack claims to have preserved the message ordering, and Lynam confirmed recently that the spam and ham was kept in the same file, so it looks like interleave was preserved in this test.

Corrective Training Delay

The delay in retraining classification errors is probably one of the most difficult characteristics to simulate. When a misclassification occurs, the user doesn't immediately report it - several other messages are likely to come in before the user checks their email and corrects the error. What's more, submitting an error changes data - which could cause more errors in some cases. Delay creates ehther a snowballing affect ~~- sometimes good, sometimes bad...nevertheless,~~ or a delay in mistraining the database. The result can be good or bad, but nevertheless, it's critical to an accurate simulation. The test simulation retrained immediately when an error has occured, not allowing the error to propagate or affect any other decisions. This most likely resulted in inaccurate results - especially difficult to spot where heuristic functions and statistical functions were used together.

General Grievances

My most mentionable grievance with the testing involved is that it The tests performed in Cormack and Lynam's paper wasn't a true simulation. Only some of the criteria I listed above were fulfilled. While the original message ordering and possibly the interleave were preserved, no purging was performed and no training delays were used. What's more, there was no archive window, because ~~the testers didn't perform any~~ no initial training was performed before taking ~~measurement. Statistical filters know nothing when you train them. Therefore, if you're going to measure their accuracy, you need to train them first. If you start measuring before you've taught~~ measurements, which grossly threw off the filter anything, then you're going to end up with some pretty mediocre results.

This seems to be part results of the failure of this test. Many filters have an initial training cycle in which many features are disabled. DSPAM specifically disables many advanced features such as Bayesian Noise Reduction until 2500 innocent messages are learned - it likes testing.

A Closed Test

The scientific process demands peer review. Cormack has refused to ~~play it very safe unless told otherwise by the user (who will usually wait, turn~~ make his test code (even without the ~~knob,~~ mail corpus) or ~~train a corpus of mail). More importantly to~~ his configuration log or other notes available. This makes the ~~DSPAM results~~ test very hard to trust as nobody is ~~an algorithm called statistical sedation~~ able to really look inside and validate his work, or find bugs in his code. His tests assume that he hasn't made any errors in implementation which is a tunable feature that waters down filtering until the training cycle is complete - in order to prevent false positives. Users who would like better accuracy on day 1 can turn this knob in one direction. It doesn't appear this feature was disabled in the tests (which would obviously explain his weird regression curve for DSPAM), nor does it appear that any acceptable levels of training were performed before taking measurement. I have no doubt this is what resulted in the mediocre results of many filters.

Many other issues discredit the findings of this test as well. I've outlined them below. The scientific process demands peer review. Cormack has refused to make his tests (even without the mail corpus) available, as well as his configuration log. most likely very incorrect. In order for any scientific test to be valid, it must be reviewed by an independent party (or many parties). If these tests had been made public, it wouldn't surprise me to see much more public support and possibly even some contributions by developers (including myself) to make the tests better.

I am very suspicious of any closed test, regardless of the results. Since the filter authors were not directly involved in these tests, their reliability is limited to the extent that they were implemented correctly. I'm afraid I can't give any credibility to a test that cannot be reviewed.

Old Versions of Software Were Used

If you're going to conduct tests, for the sake of science use recent versions of the software. It was reported that version 2.8.3 of DSPAM was used - v2.8 is well over two major production releases and six months old! In fact, version 2.10 had been released for a month when I was originally contacted by Cormack in April, but appeared to still be using a version from January (I'm only finding this out now). ~~In fact,~~ 3.0.0 had also been under development for about three months prior to its release, with public betas available. As I understand it, older versions of other software were also used, such as Bogofilter (0.17).

At the very best, this test shows us the state of spam filtering from early releases of these tools, which are more than six months ~~old, meaning we're dealing with spam as~~ old ~~as 14 months~~ - even had this test been conducted without errors, it it is already obsolete as of its ~~publish.~~ publish. At least this is based on the fact that he was using software six months old (from January '04 - Mar '04). His article claims the corpus started in August 2003, but that doesn't make much sense as that would put the date of his test at around March or April '04, which of course outdates the software he was using. So either he used software much older than what was available (even in production) or his numbers are wrong.

The Test Subject was Anything but Typical

The introduction to If this is the ~~research paper makes~~ case, then the following statement:

"While our study is limited to software pre-dated the ~~extent~~ spam that ~~X's email~~ was being used to measure the filter. This is ~~typical"~~

Yet later on a big no-no in testing. If you test 2004 safety features, you don't test with 2003 vehicles, unless you are specifically testing for the ~~paper we read that X's email consisted of over 49,000 emails over a period~~ effectiveness of ~~eight months. The paper also makes this statement~~ _older_ models in comparison to newer ones (and they weren't). In this test, the versions of software used should be just as recent as the mail archived for testing. While statistical filters are excellent at learning new types of spam, many unrelated tactics also affect the filter software, such as new encoding tricks and such - things that require minor tweaking of the software.

The Test Subject was Anything but Typical

The introduction to the research paper makes the following statement:

"While our study is limited to the extent that X's email is typical"

Yet later on in the paper we read that X's email consisted of over 49,000 emails over a period of eight months. The paper also makes this statement about X:

X has had the same userid and domain name for 20 years. Variants of X's email addresses have appeared on the Web, and in newsgroups. X has accounts on several machines which are forwarded to a common spool file, where they are stored permanently in the order received.

This seems very atypical. atypical which greatly limits the usefulnes of this study. The test subject does not represent typical email behavior, except among the most hardcore geeks. Even still, typical hardcore geeks will adjust this behavior in an attempt to curve spam. The typical technical user (someone who makes his living online) will have the same email address for perhaps five or more years, and the typical non-technical user (a majority of the users on the Internet, lest we forget) will change email addresses every couple of years. In either case, most ~~sane~~ users use one or two variants at the most. ~~49,000 emails in eight months is also absurd.~~ A good test should have included independent tests with corpora from 10-15 different test subject, of all walks of life - geek, doctor, etc. Since X's email isn't available for examination, we can only draw some assumptions, which make for a strong case that the test subject was not typical and may have helped provide skewed results:

Due to X's ~~extremely~~ high volume of traffic and the fact that X's email addresses were available to harvest bots on the Web and in newsgroups for 20 years, it is no surprise that X has an abnormally high spam ratio, 81.6%. The typical user has a spam ratio of perhaps 60% with 80% being very high (including geeks who have had the same email address for years). Having an abnormally high spam ratio, many spam filters are likely to perform at less-than-optimal levels without basic tuning. This is for two reasons. First, an overabundance of spam can seed the filter's wordlist with tokens that would otherwise be considered legitimate, but because the user only receives a small percentage of legitimate mail, these words become "flooded", even in a Train-on-Error situation. This can leave a wordlist with an underabundance of legitimate tokens due to this flooding. Secondly, the algorithm used to calculate token value in most statistical filtering approaches relies to some degree on compensating for an unbalanced corpus. Having an unusual ratio of spam (+80%) can cause some filters to overcompensate (unless tuned properly) and result in less than optimal levels of accuracy. The same is true in the other direction as well - an overabundance of legitimate mail with very few spams will result in a significant number of spam misses due to these algorithms overcompensating. In practice, users with a massively unbalanced corpus of mail, such as the test subject, would need to perform some additional tuning and possibly corpusfeeding in order to achieve optimal results.
The test subject used many different variants of email addresses, which provided many different variants of header information to analyze, and possibly a very unbalanced set of ~~of header information.~~ data. For example, if X had 20 email addresses, but only used 4 of them for day-to-day legitimate mail, then that means a statistical spam filter would learn that any mail addressed to the other 16 would most likely be spam - providing an unbalanced set of header information to analyze. For the occasional (low-traffic) messages filtering into one of these 16 boxes, this is a death sentence. Typical users do not have more than a few variants, and typically only ones that they are using to receive legitimate mail. In fact, if a user is experiencing poor levels of accuracy one of the first things I ask them is if they have a bunch of unused email addresses active - whenever one does, and we turn it off, accuracy improves.
With over 49,000 messages in eight months, it is reasonable to say that this user was very active on email. The fact that he/she was on newsgroups and the web suggests an overly diverse email behavior, which requires at least a few different options to be selected for that user. TOE mode is terrible at identifying new kinds of email behavior - they should have used Train-Everything or Train-until-Mature for all filters. Cormack makes the claim that DSPAM and CRM114 don't support Train-Everything mode, but it is in fact the default for DSPAM, and is supported by CRM114. For DSPAM v3.0.0, I would have recommended trying both Train-Everything and Train-until-Mature.

The Accuracy of the Test Subject's Corpus is Questionable

The research paper claims that the mail corpus was run against SpamAssassin for classification and compared to existing results. Any discrepencies were resolved by a human. This fails to account for two issues, which question just how accurate the corpus was in the first place:

It appears from the wording of the paper that messages which SpamAssassin believed were correct in both runs were accepted without human review. The fact that the same program was used to determine the results of the corpus suggests it is extremely likely that both versions of the software could make the same mistakes repeatedly, as is the problem with monocultural spam filtering. This would go unchecked. SpamAssassin, with learning turned off, is advertised to be only as accurate as 95% (making 1 error for every 20 messages). Instead, what the testers should have done is run a different spam filter, or perhaps two or three other filters on the corpus to determine conflicts requiring human attention. You can't use a less accurate tool to prepare a test for a more accurate tool!

NOTE: The paper does make claims that discrepencies noted during the tests were examined, but this appears only to be an afterthought. The gold standard itself was described as being initially set between two copies of SpamAssassin and the user. Lynam claims that tests were re-run if an error was found, but it doesn't seem as though the testers would have been looking for errors during this phase, as the gold standard had already been established. Ideally, all of this should've been a part of defining the gold standard in the first place.
All conflicts were resolved by a human (assumably the test subject) after the fact; e.g. they were not resolved during the eight month period in which mail was collected. The fact that there were any conflicts in the first place prove that the test subject was not very accurate at manually classifying their own mail - if they were, there would not be any conflicts. The accuracy of the test subject would obviously diminish eight months after the fact, so if they weren't very accurate to begin with, there were most definitely many human errors made when conflicts were resolved. Bill Yerazunis did a study of human accuracy and classified his mail several times by hand. He came to the conclusion that humans were typically only 99.84% accurate. If the corpus was, at the most, 99.84% accurate (and this is generous being that it was classified eight months later) then that means that any tools which were more accurate would spot these errors, and appear as erroring themselves.

The Corpus was Classified by SpamAssassin, for SpamAssassin

SpamAssassin is immediately eliminated from the credibility of these results because the test corpus was classified by SpamAssassin (twice) and the test was ultimately a product of SpamAssassin's decisions. Everyone knows that computers do exactly what they're told to do. Even if SpamAssassin made 1,000 errors, it would most likely make some of those errors again even with a learning piece enabled. Regardless of the accuracy of the corpus, SpamAssassin was tailored specifically to act as the referree for the mail corpus being used, and therefore will obviously provide the desired results (or similar, with learning enabled). If you use a tool that is only 95% accurate to prepare a test for tools that are 99.5% accurate, then the lesser tool will appear to outperform the better tools whenever the better tools are correct. This could have been avoided had many filters been used to classify the corpus, or had the test been limited to a manageable number of emails.

Furthermore, heuristic functions were designed to specifically detect the spams used. The emails being 8 months old, heuristic rules were clearly updated during this time to detect spams from the past eight months. The tests perform no analysis of how well SpamAssassin would do up against emails received the next day, or the next eight months. Essentially, by the time the tests were performed, SpamAssassin had already been told (by a programmer) to watch for these spams. A fair test should have used a set of heuristic rules eight months older than the corpus of mail or more (note, I didn't say software 8 months old). In all likelihood, many of the spams collected at the beginning of the test had already had rules specifically coded for them, so it's possible going back 12 months might be necessary to remove the human programming. What good is a test to detect spam filter accuracy when the filter has clearly been programmed to detect its test set?

Examining Impartiality and Experience

Mr. Cormack approached me some months ago wanting to perform some of these tests Pretraining Existed for ~~a paper he was writing.~~ Some Tests, Not ~~to make any personal attacks against Mr. Cormack, but there are two qualities~~ Others

Of course, this raises the issue that ~~should be examined in~~ SpamAssassin was the ~~tester~~ equivalent of ~~any test - impartiality and experience. No matter how precise any test appears~~ a pre-trained filter, while all the other filters were not trained. A significant amount of pre-intelligence was embedded into SpamAssassin prior to ~~be,~~ testing (rules written by a human specifically to detect these ~~two qualities~~ spams). If we are ~~always a factor in~~ to measure pre-trained filters, they should be pitted against other pre-trained filters. The testers argue that the ~~results~~ entire process was graphed. Unfortunately, this is not sufficient. All of the ~~testing. The credibility~~ filters were measured from their starting point - of ~~a forensic pathologist~~ which SpamAssassin was given the obvious advantage by being pre-trained. Results should not have been measured until each filter was sufficiently trained as well. Also, pre-training is ~~critical~~ very different from learning. When filters learn, they learn differently than they do when ~~they're in court explaining the evidence they've gathered~~ they pre-train. For example, SpamProbe and DSPAM perform test-conditional (or iterative) training, which re-trains certain tokens until the ~~methods they used. The same~~ erroneous condition is ~~true of testing, and even~~ no longer met. When these filters are pretrained, however, the tokens are trained only once for each message. This leaves the dataset in a ~~good test should go through~~ very different state.

This seems to be part of the ~~healthy exercise~~ failure of ~~examining impartiality and experience. I personally am~~ this test. Many filters have an initial training cycle in ~~no position~~ which many features are disabled. DSPAM specifically disables many advanced features such as Bayesian Noise Reduction until 2500 innocent messages are learned - it likes to ~~perform comparison testing because I am~~ play it very ~~attached to my project (DSPAM). You won't see me performing these types~~ safe unless told otherwise by the user (who will usually wait, turn the knob, or train a corpus of ~~tests on my own, and if I do you are welcome~~ mail). More importantly to ~~call me on it. Similarly you won't find me trying~~ the DSPAM results is an algorithm called statistical sedation which is a tunable feature that waters down filtering until the training cycle is complete - in order to ~~perform tests~~ prevent false positives. Users who would like better accuracy on ~~SpamAssassin because I am not experienced enough with~~ day 1 can turn this knob in one direction. It doesn't appear this feature was disabled in the ~~software. Instead, I rely on~~ tests (which would obviously explain his weird regression curve for DSPAM), nor does it appear that any acceptable levels of training were performed before taking measurement. This is probably what resulted in the mediocre results ~~others have come to.~~ of many filters.

~~Cormack seemed argumentative and appeared~~ Bill Yerazunis had the excellent idea of performing two tests: the first test measured the ability to ~~have some negative presuppositions about statistical~~ detect spam, but then the second test would flip around the corpus (what is spam ~~filtering, which~~ is ~~why I decided not~~ now ham, what is ham is now spam), and the filter would then be instructed to ~~help him at~~ detect spam again. The _worst_ of the ~~time, except for answering a few basic questions~~ two tests are the results to be used. This would remove any pre-intelligence programmed into any filter and ~~pointing him~~ measure them based on their ability to detect what the ~~documentation. Call me crazy, but I believe that any honest test must~~ user tells them is spam. Unfortunately, it doesn't look like this will be ~~free from bias, especially if submitted as a research paper. The makings~~ incorporated into the testing.

Closed Test + No Filter Author Involvement = Many Potential Misconfigurations

Mr. Cormack approached me some months ago wanting to perform some of these tests for a ~~great researcher start with impartiality.~~ paper he was writing. Understandably, Mr. Cormack was very frustrated about not being able to achieve any reasonable levels of filter accuracy from his filters, including DSPAM. It turned out that ~~Cormack~~ he was using the wrong flags, didn't understand how to train correctly, and seemed very reluctant to fully read the documentation. I don't mean to ride on Cormack, but proper testing requires a significant amount of research, and research seems to be the one thing lacking from this research paper.

Another thing that ~~concerned~~ concerns me ~~was when~~ is the level of experience Mr. Cormack ~~informed me~~ has with statistical filters. Cormack argues that ~~he had never~~ he's used SpamAssassin and Mozilla, with a ~~statistical~~ little bit of experience on some others...but this doesn't seem like sufficient experience with a command-line, server-side _pure statistical_ filter ~~prior to his testing. The frustration Cormack appeared~~ to ~~experience would have come from being~~ me, at least ones like those he measured, which require a ~~little green - and that's understandable.~~ much different level of experience than these tools. It does concern me ~~though~~ that it may have been out of frustration that Cormack decided to test with a six month old version of DSPAM (instead of the version available when he contacted me, only one month old) and used several flaming words such as "inferior" in his paper to describe the software. Sadly, the bias that I felt when talking with Cormack discredited any possible findings he could make to me, and I told him this. I sincerely hope Cormack learns from this and makes attempts to improve his testing in the future.

~~I am~~ I'm not trying to slander Mr. Cormack, but I think it's important to note that ~~any~~ a closed test without enough experience isn't going to yield the desired results. This is why an open test ~~with bias and inexperience~~ is ~~invalid.~~ so important, as well as review. In my opinion, I do not believe Cormack has made the effort to become as knowledgeable in this area as he needs to be to run tests.

~~(Some of all of) The Spam Filters Tested Were Misconfigured~~

As with many rushed tests, the testers don't always find the time to become intimate with the tools they're testing. In this case, it was very obvious to me when originally speaking with Cormack that he was using the software incorrectly, but in his research paper it appears that the documentation was not adequately consulted even up to the test.

It states on page 30 of the research paper that DSPAM doesn't support Train-Everything mode, and that training was performed using Train-on-Error. Train-Everything mode was the first mode available in DSPAM, and Train-on-Error was only coded into the software as of version 2.10, so the test had to be using Train-Everything without knowing it, and treating it like TOE. (NOTE: I've been informed that Cormack would be correcting his paper to reflect this). Many other incorrect statements about the different filters he's using suggests to me that the testers still didn't understand the filters he'd been testing.

One of my last exchanges with Cormack before his testing involved his approach to training. It appeared as though errors were not being retrained correctly which I am confident made a significant contribution to his poor results with DSPAM. Instead of presenting a message as an error, It was submitted as a corpusfed spam. This would have learned the tokens as spam, but not un-learned the erroneous innocent hits on each token - so the message became learned as both ham and spam. The tests also failed to present the outputted message, but presented the original message for retraining. Unless specifically configured to do so (his copy was not), DSPAM looks for an embedded "watermark" it has added to each email it processes. This watermark provides a serial number referencing the original training data. When it cannot be found, only the message body is retrained (e.g. it tries to do its best assuming the user forwarded in the spam, and so you'd have their headers instead). By providing the original message for retraining, and not the DSPAM processed message (with watermark), DSPAM was (at the time at least) training only half of each message - leaving the headers without retraining. I provided him with this information, but I'm not entirely certain that the corrections made were sufficient prior to testing. This is understandably confusing, and was ambiguous enough of a "feature" that it was removed in v3.0.0, but because the documentation was not followed, most likely caused unpredictable results in testing and practice.

It's very odd that the paper would reference v3.0.0's peak 99.991% (which is just the peak - the highest it can go under ideal conditions), but would be using 2.8 to measure this. One problem I think is evident is that Mr. Cormack spoke with me in April (at which point v2.10 had been out for over a month), but was apparently using v2.8 (a version released in January). This may have been part of the problem with a "feature" that it was removed in v3.0.0, but because the ~~results, as well as my efforts to help him out.~~ documentation was not followed, most likely caused unpredictable results in testing and practice.

In fact, we really don't know how these tools were configured or what backends they used. DSPAM supports six different possible back-ends (some of which are beta, and some of which are unsupported), as well as three different training modes (TOE, TEFT, and TUM). DSPAM also supports Graham-Bayesian, Burton-Bayesian, Geometric Mean, and Robinson's Chi-Square algorithms. On top of this, there are two different algorithms for computing token value, and plenty of other knobs. We have no idea how each tool was specifically configured nor did anyone involved in testing appear to post configurations or specific details about their testing approach.

The Tests ~~Invalidate Themselves By~~ Lack of Real-World Validation

The tests don't come close to reflecting real-world levels of accuracy experienced by many filters. CRM-114 users experience typical levels of accuracy surpassing 99.95%, yet the tests show otherwise. The same is true for many other filters. Even the ones that were rated fairly did not reflect the real-world accuracy I hear about. When the results of a test don't even come close to human experience, the tests are possibly erroneous and need to be analyzed, not published. If the tests are reviewed by one or two independent parties, retried, and nothing can be found wrong with them, THEN publish them - even if they go against the accepted performance...but that's not the case here. No retrials were performed, no independent party confirmed the validity of these tests, and as a result we ended up with some very oddball results. Please note again, CRM114 is not my tool - I'm not affiliated with it in any way except that I hold it in very high regard having seen how it functions mathematically and am quite certain of its mathematical superiority in both theory and practice. I suspect these tests could have been compared at one point to Cormack's own experience, which as he informed me were very poor with statistical filtering - this was due primarily to poor implementation on Cormack's part, having analyzed his own configuration personally.

Conclusions

~~I could continue to rant on about the quality of~~ Many technical errors have been made in this test, so many in fact that the test ~~but I don't think that's necessary. The only conclusions~~ is beyond recovery in my opinion. I ~~can draw, unfortunately,~~ believe a new test is in order - one that ~~Cormack (in my opinion) was~~ corrects the ~~wrong person to conduct~~ deficiencies outlined in this ~~test~~ article, and more importantly an open test that the ~~tests had many technical flaws. He had claimed to never have used~~ filter authors can be involved in. As a ~~statistical filter, and appeared to have plenty~~ result of ~~preconceived grievances with it. This ultimately led, in my opinion, to~~ the many technical ~~errors which~~ deficiencies and the general mystery behind how these tests were ~~made.~~ really performed, I do not believe these tests to be credible - especially not credible enough to appear in any journal.

Mr Lynam seems much more interested in the scientific process and far less argumentative. I would be interested in seeing him pair up with a different party to conduct a newer, better test.

I sincerely hope ~~Cormack considers~~ these errors are considered and ~~improves his~~ improved in their testing. I am confident that he they will find tools such as CRM114 and DSPAM to prove as extremely accurate as their loyal users are finding them.

NOTE: I have spoken to Gordon about this article and although we have many disagreements about his test and my comments, we are working on analyzing what exactly went wrong (or at least that's my perspective) - not to further discredit his test, but what I hope will improve any future testing. It's difficult though, as Cormack won't release the code used to perform his testing or configuration logs for the software as I'm also confident that every last one of ~~yet.~~ the statistical filters measured will prove superior.

Reproduction prohibited without permission