|
Review of Gordon Cormack's Study of Spam Detection
Last Update: Saturday, June 26 2004, 22:00
Jonathan A. Zdziarski
jonathan@nuclearelephant.com
Introduction
Many misled CS students, Ph.Ds, and professionals have jumped on the
spam filtering bandwagon with the uncontrollable urge to
perform misguided tests in order to grab a piece of the interest surrounding
this area of technology. As quickly as Bayesian
filtering (and other statistical approaches) have popped up, equal levels of
interest arose among many major groups (academic, press, and sysadmins to name a few). With regret, much of the testing published on the Internet
thus far has been not much more than zeal without wisdom. This is due to
the technology still being held fairly close to the vests of those
implementing it (a result of its freshness and artistic complexity). This is
not necessarily the fault of the testers; statistical filtering has grown to
become much more than "A Plan for Spam", but unfortunately there is little
useful documentation on the actual implementation and theory behind the
latest improvements (one of the reasons I have written a book on the
subject, scheduled for December 2004).
This article is a response to a research paper by Gordon Cormack and
Thomas Lynam entitled,
"A Study of Supervised Spam Detection applied to Eight Months of
Personal E-Mail". This paper was recently featured on Slashdot
which has unfortunately resulted in a large swarm of geeks with wrong
information, drawing wrong conclusions about statistical filtering.
It is not my desire to flame the test or the testers, but there are many
errors I believe need to be brought to light.
The testing fails in many ways, and has unfortunately marred many superior
statistical filters, such as CRM114 (which has proved statistical
superiority time and time again). CRM114 isn't my puppy, but I
do believe it to be one of, if not the, most accurate filters in the
world. I
haven't tried to get into very deep philosophical problems with the testing,
although there are some, but I've tried instead to provide a list of reasons
the testing was performed incorrectly. Perhaps after reading this,
the researchers might make another attempt to run a successful test. Until then, I'm afraid
these results are less than credible.
Don't get me wrong, I'm glad to see that all the spam filters tested did
very well. When we're measuring hundredths of a percent of accuracy, though,
good enough doesn't cut it. The intricacies of testing can easily throw
the results off a point or two which makes a considerable impact on the
results. All of these filters do an excellent job at filtering spam, as proved
by this test and others,
but the test failed to conduct an adequate comparison due to many flaws.
Paul Graham has spoken of conducting a bake-off at the next MIT Spam
Conference. Hopefully, these tests provided some "gotchas" to watch for.
You may be wondering why I've taken the time to write an article about a
failed test. Well, as of today the only successful way to beat a Bayesian
filter is not to run one. Spammers can't get around the technology but by
means of bad press, and if tests like this cause people to draw the incorrect
conclusion that statistical filtering is ineffective, nobody's going to run
one. Statistical filtering has shown, if anything, to be both adaptive and
highly accurate. Unfortunately, people don't know how to test it yet in a way
that will simulate real-world results.
The Challenge of Testing
Statistical filtering is considered by most as the next generation technology
of spam filtering. Statistical filtering is dynamic, in that it learns from
its mistakes and performs better with each correction. It has the unique
ability to learn to detect new types of spams without any human intervention.
This has all but obsoleted the many heuristic filters once considered
mainstream, and even tools such as SpamAssassin have incorporated statistical
components into their filters.
Statistical language analysis is unlike any other spam filtering approach we've seen and because of this, tests different from any other beast. The testing approaches used to measure heuristic spam filters are frequently and erroneously applied to statistical filters resulting in poor testing results. The intricacies of machine learning require a more scientific approach than simply throwing mail at the filter, and even the most detailed approaches to testing such a tool only barely succeed in accomplishing a real-world simulation.
Modern day language classifiers face a very unique situation - they learn based on the environment around them. The problem is therefore a delicate balance of
controlled environment. When heuristic filtering was popular, there were many different ways to test it. Since the filter didn't base its decisions on the previous results, an accurate set of results would accommodate just about any type of testing approach used.
The state of a statistical language classifier is similar to that of a sequential circuit in that the output is a combination of both the inputs and the previous state of the filter. The previous state of the filter is based on a previous set of inputs, which are based on a previous set of results, and so on.
Think of controlled environment in terms of going to the supermarket every week; what you buy from visit to visit is based on what you have in your refrigerator. A single change in the environment, such as a snowstorm, can easily
snowball to affect the results of your refrigerator by many twinkies
and affect your grocery purchases for a few weeks. With this in mind, the challenge of testing is to create an environment that simulates as closely as possible a real-world behavior - after all, the accuracy we are trying to measure is how the filter will work in the real world. It is therefore suffice to say that testing a statistical filter is no longer a matter of testing, but one of simulation. Simulating a real-world behavior takes many factors into consideration that obsolete heuristic testing doesn't.
Message Continuity
Message continuity refers to the uninterrupted threads and their message content, dealing specifically with the set of test messages used. Threading is
important to a statistical filter as is message ordering. Statistical filters
learn new data incrementally as it is mixed with already known data. As
email evolves (spam and legitimate mail alike), characteristics slowly change.
If the messages are presented out of order, incremental learning breaks.
A corpus where random messages are hand-picked will lead to the degradation
of the filter.
Consider this simplified example. Lets say message A is a spam about rolex
watches. Your filter learns that words like "Rolex" are spam. Message B then
comes in, which is an email from a friend who is in the jewelry business. He
talks about a few different types of jewelry he's ordered for the week. The
message is accepted by the filter as legitimate, but only by the skin of its
teeth because he mentioned Rolex watches. In message C, you've decided to
talk a little bit about Rolex watches as you've always thought about getting
one. Now, if message B was learned by the filter then by now Rolex watches
will be considered relatively neutral (could be spam, could be ham), because
you've received legitimate communication about it. On the other hand, if
message B was plucked out of the corpus so that it was never trained, then
message C is going to be bit-bucketed because the filter never learned that
rolex emails might not be spam.
In real-life, there are many message that make it in by the skin of their
teeth (e.g. with a low confidence). A single token could, in some cases,
affect the final result of a message. If the tester fails to train all
of the messages from a users' corpus, they're going to be omitting data which
could play a crucial role in evaluating "borderline" messages correctly. It's
unclear as to whether Cormack maintained message continuity, as the corpus
isn't available for analysis. It is implied that he did, but not stated.
Archive Window
As we've learned through Terry Sullivan's research at the MIT Spam Conference
in 2004, spam evolves on the order of months. Other tests confirm this and show us that the seas change every 4 to 6 months on average. It's important to have
a concurrent archive of mail if we're going to use one archive for training.
Many tests out there today fail to use an adequate archive window, and try to
test on two weeks of mail. What many fail to realize is that adaptive learning
isn't really associated with quantity of mail, but rather time span to which
the filter is capable of learning permutations. Repeating the same data over
and over again won't teach a statistical classifier how to identify spam any
better than putting beets on the table every night will teach a child to
like beets more. Only after the filter is able to see the many different
lexical permutations in a corpus spanned across several months will it be able
to effectively filter at high levels of accuracy.
Cormack's test used incremental learning without a training corpus, and so
this didn't appear to be a significant issue - although I do believe his
method of training in general was flawed by not using an archive window for
pretraining (discussed later).
Purge Simulation
An area frequently disregarded in a statistical learning simulation is the purging of stale data. When the training corpus is learned, each message is trained within the same short period of time (usually a period of several minutes or hours). The usual method of purging that a particular filter might employ doesn't take place because all of the data trained is considered new. Purging is important
because older data can affect the polarity of newer data. For example, if
tokens that haven't been seen in four months reflect a spammy polarity,
then are purged, and a couple new messages come in using those tokens in
a legitimate way, then purging will allow the tokens to take on their
most recent polarity. Without purging, the tokens would become fairly
neutral and be eliminated from computation, ultimately affecting accuracy.
Another area purging affects (which directly affects accuracy) is the amount
of data required to migrate the polarity of tokens in the dataset for
training or retraining. If an old record exists for a particular token with
100 data points, that record will take much more time to change polarity than
a fresh record or one with few datapoints. By not purging the old data,
it not only lingers but it causes the tokens to "stick" much more.
This test did not include any purge simulation at all, leaving 49,000 someodd
messages trained as a composite in the wordlist.
Interleave
The interleave at which messages from the corpus are trained, corrected, and classified can play a dramatic role in the results of the test. Many tests are erroneously performed by feeding in two separate corpora - one of legitimate mail and one of spam. Some tests use a 1:1 interleave, while others try their best to simulate a real-world scenario.
Going back to our original test on message continuity, think about the affect
of training only one nonspam about jewelry before focusing on rolex watches
as opposed to training ten. If you discussed rolex watches at length before
receiving more rolex spam, then chances are the messages will be classified
correctly. On the other hand, interrupting this flow will, at the very least,
cause the filter to respond in an artificial way rather than how it
would in real-world scenarios.
The original ordering of the messages in the corpus will generally yield the most realistic results. Cormack claims to have
preserved the message ordering, and Lynam confirmed recently that the spam
and ham was kept in the same file, so it looks like interleave was
preserved in this test.
Corrective Training Delay
The delay in retraining classification errors is probably one of the most difficult characteristics to simulate. When a misclassification occurs, the user doesn't immediately report it - several other messages are likely to come in before the user checks their email and corrects the error. What's more, submitting
an error changes data - which could cause more errors in some cases.
Delay creates ehther a snowballing
affect or a delay in mistraining the database. The result can be good or bad,
but nevertheless, it's critical to an accurate simulation. The test simulation
retrained immediately when
an error has occured, not allowing the error to propagate or affect any other
decisions. This most likely resulted in inaccurate results - especially
difficult to spot where heuristic functions and statistical functions were used
together.
The tests performed in Cormack and Lynam's paper wasn't a true
simulation. Only some
of the criteria I listed above were fulfilled. While the original message
ordering and possibly the interleave were preserved, no purging was performed
and no training delays were used. What's more, there was no archive window,
because no initial training was performed before taking measurements, which
grossly threw off the results of the testing.
A Closed Test
The scientific process demands peer review. Cormack has
refused to make his test code (even without the mail corpus) or his configuration log or other notes available. This makes the test very hard to trust
as nobody is able to really look inside and validate his work, or find bugs in his
code. His tests assume that he hasn't made any errors in implementation which
is most likely very incorrect. In order for any scientific test to be valid, it
must be reviewed by an independent party (or many parties). If these tests had
been made public, it wouldn't surprise me to see much more public support
and possibly even some contributions by developers (including myself)
to make the tests better.
I am very suspicious of any closed test, regardless of the results. Since the
filter authors were not directly involved in these tests, their reliability is
limited to the extent that they were implemented correctly. I'm afraid I can't
give any credibility to a test that cannot be reviewed.
Old Versions of Software Were Used
If you're going to conduct tests, for the sake of science use recent versions of the
software. It was reported that version 2.8.3 of DSPAM was used - v2.8 is
well over two major production releases and six months old! In fact, version
2.10 had been released for a month when I was originally contacted by Cormack in
April, but appeared to still be using a version from January (I'm only
finding this out now).
3.0.0 had also been under development
for about three months prior to its release, with public betas available. As I understand it,
older versions of other software were also used, such as Bogofilter (0.17).
At the very best, this test shows us the state of spam filtering from
early releases of these tools, which are more than six months old -
even had this test been
conducted without errors, it is already obsolete as of its publish.
At least this is based on the fact that he was using software six months old
(from January '04 - Mar '04). His article claims the corpus started in August
2003,
but that doesn't make much sense as that would put the date of his test at
around March or April '04, which of course outdates the software he was
using. So either he used software much older than what was available
(even in production) or his numbers are wrong.
If this is the case, then the software pre-dated the spam that was being used
to measure the filter. This is a big no-no in testing. If you test 2004
safety features, you don't test with 2003 vehicles, unless you are specifically
testing for the effectiveness of _older_ models in comparison to newer ones
(and they weren't). In this
test, the versions of software used should be just as recent as the mail
archived for testing. While statistical filters are excellent at learning new
types of spam, many unrelated tactics also affect the filter software, such as
new encoding tricks and such - things that require minor tweaking of the
software.
The Test Subject was Anything but Typical
The introduction to the research paper makes the following statement:
"While our study is limited to the extent that X's email is typical"
Yet later on in the paper we read that X's email consisted of over 49,000
emails over a period of eight months. The paper also makes this statement
about X:
X has had the same userid and domain name for 20 years. Variants of X's
email addresses have appeared on the Web, and in newsgroups. X has
accounts on several machines which are forwarded to a common spool file,
where they are stored permanently in the order received.
This seems very atypical which greatly limits the usefulnes of this
study. The test subject does not represent typical
email behavior, except among the most hardcore
geeks. Even still, typical hardcore geeks will adjust this behavior in an
attempt to curve spam. The typical technical user (someone who makes his
living online) will have the same email address for perhaps five or more years,
and the typical non-technical user (a majority of the users on the Internet,
lest we forget) will change email addresses every couple of years.
In either case, most users use one or two variants at the most. A good test should have included independent tests with
corpora from 10-15 different
test subject, of all walks of life - geek, doctor, etc. Since X's email isn't available for examination, we can only draw
some assumptions, which make for a strong case that the test subject was
not typical and may have helped provide skewed results:
- Due to X's high volume of traffic and the fact that X's email addresses were available to harvest bots on the Web and in
newsgroups for 20 years, it is no surprise that
X has an abnormally high spam ratio, 81.6%. The
typical user has a spam ratio of perhaps 60% with 80% being very high
(including geeks who have
had the same email address for years). Having an abnormally high
spam ratio, many spam filters are likely to perform at less-than-optimal
levels without basic tuning. This is for two reasons. First, an overabundance of spam can seed
the filter's wordlist with tokens that would otherwise be considered legitimate,
but because the user only receives a small percentage of legitimate mail,
these words become "flooded", even in a Train-on-Error situation.
This can leave a wordlist with an
underabundance of legitimate tokens due to this flooding. Secondly, the algorithm used to calculate
token value in most statistical filtering approaches relies to some degree on
compensating for an unbalanced corpus. Having an unusual ratio of spam
(+80%) can cause some filters to overcompensate (unless tuned properly) and
result in less than optimal levels of accuracy. The same is true in the
other direction as well - an overabundance of legitimate mail with very
few spams will result in a significant number of spam misses due to these
algorithms overcompensating. In practice, users with a massively unbalanced
corpus of mail, such as the test subject, would need to perform some
additional tuning and possibly corpusfeeding in order to achieve optimal
results.
- The test subject used many different variants of email addresses, which provided many
different variants of header information to analyze, and possibly a very
unbalanced set of data. For example, if X had 20 email
addresses, but only used 4 of them for day-to-day legitimate mail, then that means
a statistical spam filter would learn that any mail addressed to the other 16
would most likely be spam - providing an unbalanced set of header information
to analyze. For the occasional (low-traffic) messages filtering into one
of these 16 boxes, this is a death sentence. Typical users do not have more
than a few variants, and typically only ones that they are using to receive
legitimate mail. In fact, if a user is experiencing poor levels of accuracy
one of the first things I ask them is if they have a bunch of unused email
addresses active - whenever one does, and we turn it off, accuracy
improves.
- With over 49,000 messages in eight months, it is reasonable to say that
this user was very active on email. The fact that he/she was on newsgroups
and the web suggests an overly diverse email behavior, which requires at least
a few different options to be selected for that user. TOE mode is terrible at
identifying new kinds of email behavior - they should have used
Train-Everything or Train-until-Mature for all filters. Cormack makes the
claim that DSPAM and CRM114 don't support Train-Everything mode, but it is
in fact the default for DSPAM, and is supported by CRM114. For DSPAM v3.0.0,
I would have recommended trying both Train-Everything and Train-until-Mature.
The Accuracy of the Test Subject's Corpus is Questionable
The research paper claims that the mail corpus was run against SpamAssassin for
classification and compared to existing results. Any discrepencies were
resolved by a human. This fails to account for two issues, which question
just how accurate the corpus was in the first place:
- It appears from the wording of the paper that messages which SpamAssassin
believed were correct in both runs were
accepted without human review. The fact that the same
program was used to determine the results of the corpus suggests it is
extremely likely
that both versions of the software could make the same mistakes repeatedly,
as is the problem with monocultural spam filtering. This would
go unchecked. SpamAssassin, with learning turned off, is advertised to be
only as accurate as 95% (making 1 error for every 20 messages). Instead, what
the testers should have done is run a different
spam filter, or perhaps two or three other filters on the corpus to
determine conflicts requiring human attention. You can't use a less
accurate tool to prepare a test for a more accurate tool!
NOTE: The paper does make claims that discrepencies noted during the tests
were examined, but this appears only to be an afterthought. The gold standard
itself was described as being initially set between two copies of SpamAssassin
and the user. Lynam claims that tests were re-run if an error was found, but
it doesn't seem as though the testers would have been looking for errors
during this phase, as the gold standard had already been established.
Ideally, all of this should've been a part of defining the gold standard
in the first place.
- All conflicts were resolved by a human (assumably the test subject) after
the fact; e.g. they were not resolved during the eight month period in which
mail was collected. The fact that there were any conflicts in the first
place prove that the test subject was not very accurate at manually
classifying their own mail - if they were, there would not be any conflicts.
The accuracy of the test subject would obviously diminish eight months after
the fact, so if they weren't very accurate to begin with, there were most
definitely many human errors made when conflicts were resolved. Bill Yerazunis
did a study of human accuracy and classified his mail several times by hand.
He came to the conclusion that humans were typically only 99.84% accurate.
If the corpus was, at the most, 99.84% accurate (and this is generous being
that it was classified eight months later) then that means that any tools which
were more accurate would spot these errors, and appear as erroring themselves.
The Corpus was Classified by SpamAssassin, for SpamAssassin
SpamAssassin is immediately eliminated from the credibility of these results
because the test corpus was classified by SpamAssassin (twice) and
the test was ultimately a product of SpamAssassin's decisions. Everyone
knows that computers do
exactly what they're told to do. Even if SpamAssassin made 1,000 errors,
it would most likely make some of those errors again even with a learning
piece enabled. Regardless of the accuracy of the corpus, SpamAssassin was
tailored specifically to act as the referree for the mail corpus being used,
and therefore will obviously provide the desired results (or similar,
with learning enabled). If you use a tool that is only 95% accurate to
prepare a test for tools that are 99.5% accurate, then the lesser tool will appear to
outperform the better tools whenever the better tools are correct. This could have been avoided had many filters
been used to classify the corpus, or had the test been limited to a manageable
number of emails.
Furthermore, heuristic functions were designed to specifically detect the
spams used. The emails being 8 months old, heuristic rules were clearly
updated during this time to detect spams from the past eight months. The tests
perform no
analysis of how well SpamAssassin would do up against emails received the next
day, or the next eight months. Essentially, by the time the tests were
performed, SpamAssassin had already been told (by a programmer) to watch for
these spams. A fair test should have used a set of heuristic rules eight
months older than the corpus of mail or more (note, I didn't say software 8 months old). In all likelihood, many of the spams collected at the beginning
of the test had already had rules specifically coded for them, so it's possible
going back 12 months might be necessary to remove the human programming.
What good is a test to
detect spam filter accuracy
when the filter has clearly been programmed to detect its test set?
Pretraining Existed for Some Tests, Not Others
Of course, this raises the issue that SpamAssassin was the equivalent of a
pre-trained filter, while all the other filters were not trained. A significant
amount of pre-intelligence was embedded into SpamAssassin prior to testing
(rules written by a human specifically to detect these spams). If
we are to measure pre-trained filters, they should be pitted against other
pre-trained filters. The testers argue that the entire process was graphed.
Unfortunately, this is not sufficient. All of the filters were measured from
their starting point - of which SpamAssassin was given the obvious advantage
by being pre-trained. Results should not have been measured until each filter
was sufficiently trained as well. Also, pre-training is very different
from learning. When filters learn, they learn differently than they do when
they pre-train. For example, SpamProbe and DSPAM perform test-conditional
(or iterative) training, which re-trains certain tokens until the erroneous
condition is no longer met. When these filters are pretrained, however, the
tokens are trained only once for each message. This leaves the dataset in
a very different state.
This seems to be part of the failure of this test. Many filters
have an initial training cycle in which many features are disabled. DSPAM
specifically disables many advanced features such as Bayesian Noise
Reduction until 2500 innocent messages are learned - it likes to play it very
safe unless told otherwise by the user (who will usually wait, turn the knob,
or train a corpus of mail). More importantly to the DSPAM results is
an algorithm called statistical sedation which is a tunable feature that
waters down filtering until the training cycle is complete - in order to
prevent false positives. Users who would like better accuracy on
day 1 can turn this knob in one direction. It doesn't appear this feature
was disabled in the tests (which would obviously explain his weird
regression curve for DSPAM), nor does it appear that any
acceptable levels of training were performed before taking measurement. This is probably what resulted in the mediocre results of many filters.
Bill Yerazunis had the excellent idea of performing two tests: the first
test measured the ability to detect spam, but then the second test would
flip around the corpus (what is spam is now ham, what is ham is now spam),
and the filter would then be instructed to detect spam again. The _worst_ of
the two tests are the results to be used. This would remove any pre-intelligence
programmed into any filter and measure them based on their ability to detect
what the user tells them is spam. Unfortunately, it doesn't look like this
will be incorporated into the testing.
Closed Test + No Filter Author Involvement = Many Potential Misconfigurations
Mr. Cormack approached me some months ago wanting to perform some of
these tests for a paper he was writing.
Understandably, Mr. Cormack was very frustrated about not being able to achieve
any reasonable levels of filter accuracy from his filters, including
DSPAM. It turned out that he was using the wrong flags, didn't understand how
to train correctly, and seemed very reluctant to fully read the documentation.
I don't mean to ride on Cormack, but proper testing requires a significant
amount of research, and research seems to be the one thing lacking from this
research paper.
Another thing that concerns me is the level of experience Mr. Cormack has
with statistical filters. Cormack argues that he's used SpamAssassin and Mozilla,
with a little bit of experience on some others...but this doesn't seem like
sufficient experience with a command-line, server-side _pure statistical_ filter to me, at least ones like
those he measured, which require a much different level of experience than
these tools.
It does concern me that it may have been out
of frustration that Cormack decided to test with a six month old version of
DSPAM
(instead of the version available when he contacted me, only one month old) and used several flaming words such as "inferior" in his paper to describe
the software.
I'm not trying to slander Mr. Cormack, but I think it's important to note
that a closed test without enough experience isn't going to yield the desired results. This is why an open
test is so important, as well as review. In my opinion, I do not believe Cormack has made the effort to become as knowledgeable in
this area as he needs to be to run tests.
As with many rushed tests, the testers don't always find the time to become
intimate with the tools they're testing. In this case, it was very obvious to
me when originally speaking with Cormack that he was using the software
incorrectly, but in his research paper it appears that the documentation was
not adequately consulted even up to the test.
It states on page 30 of the research paper that DSPAM doesn't support
Train-Everything mode, and that training was performed using Train-on-Error.
Train-Everything mode was the first mode available in DSPAM, and Train-on-Error
was only coded into the software as of version 2.10, so the test had to be
using Train-Everything without knowing it, and treating it like TOE.
(NOTE: I've been informed that Cormack would be correcting his paper to
reflect this). Many other incorrect statements about
the different filters he's using suggests to me that the
testers still didn't understand the filters he'd been testing.
One of my last exchanges with Cormack before his testing involved his approach
to training. It appeared as though errors were not being retrained correctly
which I am confident made a significant contribution to his poor results
with DSPAM.
Instead of presenting a message as an error, It was submitted
as a corpusfed spam. This would have learned the tokens as spam, but not
un-learned the erroneous innocent hits on each token - so the message became
learned as both ham and spam. The tests also failed
to present the outputted message, but presented the original message for
retraining. Unless specifically configured to do so (his copy was not),
DSPAM looks for an embedded "watermark" it has added to each email it
processes. This watermark provides a serial number referencing the original
training data. When it cannot be found, only the message body is retrained
(e.g. it tries to do its best assuming the user forwarded in the spam, and
so you'd have their headers instead).
By providing the original message for retraining, and not the DSPAM
processed message (with watermark), DSPAM was (at the time at least)
training only half of each message - leaving the headers without retraining.
I provided him with this information, but I'm not entirely certain that the
corrections made were sufficient prior to testing.
This is understandably confusing, and was ambiguous enough of a "feature" that
it was removed in v3.0.0, but because the documentation was not followed,
most likely caused unpredictable results in testing and practice.
In fact, we really don't know how these tools were configured or what backends they
used. DSPAM supports six different possible back-ends (some of which are
beta, and some of which are unsupported), as well as three different training
modes (TOE, TEFT, and TUM). DSPAM also supports Graham-Bayesian, Burton-Bayesian,
Geometric Mean, and Robinson's Chi-Square algorithms. On top of this, there
are two different algorithms for computing token value, and plenty of other
knobs. We have no idea how each tool was specifically
configured nor did anyone involved in testing appear to post configurations or specific
details about their testing approach.
The Tests Lack Real-World Validation
The tests don't come close to reflecting real-world levels of accuracy
experienced by many filters. CRM-114 users experience typical levels of
accuracy surpassing 99.95%, yet the tests show otherwise. The same is true
for many other filters. Even the ones that were rated fairly did not reflect
the real-world accuracy I hear about. When the results of a test don't even come close to
human experience, the tests are possibly erroneous and need to be analyzed,
not published. If the tests are reviewed by one or two independent parties,
retried, and nothing can be found wrong with them, THEN publish them - even if
they go against the accepted performance...but that's not the case here. No
retrials were performed, no independent party confirmed the validity of these
tests, and as a result we ended up with some very oddball results.
Please note again, CRM114 is not my tool - I'm not affiliated with it in any
way except that I hold it in very high regard having seen how it functions
mathematically and am quite certain of its mathematical superiority in both
theory and practice.
I suspect these tests could have been compared at one point to Cormack's own experience,
which as he informed me were very poor with statistical filtering - this was
due primarily to poor implementation on Cormack's part, having analyzed his
own configuration personally.
Conclusions
Many technical errors have been made in this test, so many in fact that the
test is beyond recovery in my opinion. I believe a new test is in order -
one that corrects the deficiencies outlined in this article, and more
importantly an open test that the filter authors can be involved in.
As a result of the many technical deficiencies and the general mystery behind
how these tests were really performed, I do not believe these tests to be credible - especially not
credible enough to appear in any journal.
Mr Lynam seems much more interested in the scientific process and
far less argumentative. I would be interested in seeing him pair up with a
different party to conduct a newer, better test.
I sincerely hope these errors are considered and improved in their testing. I
am confident that they will find tools such as CRM114 and DSPAM to prove
as extremely accurate as their loyal users are finding them. I'm also confident
that every last one of the statistical filters measured will prove superior.
|
|