|
Response to Gordon Cormack's Study of Spam Detection
Last Update: Wednesday, Saturday, June 23 26 2004, 22:00
Jonathan A. Zdziarski
jonathan@nuclearelephant.com
Introduction
Many misled CS students, Ph.Ds, and professionals have jumped on the
spam filtering bandwagon with the uncontrollable urge to
perform misguided tests in order to grab a piece of the interest surrounding
this area of technology. As quickly as Bayesian
filtering (and other statistical approaches) have popped up, equal levels of
interest arose. arose among many major groups (academic, press, and sysadmins to name a few). With regret, much of the testing published on the Internet
thus far has been not much more than zeal without wisdom wisdom. This is due to
the fact that
the technology is still being held fairly close to the vests of those
implementing it (a result of its freshness and artistic complexity). This is
not necessarily the fault of the testers; statistical filtering has grown to
become much more than "A Plan for Spam", but unfortunately there is little
useful documentation on the actual implementation and theory behind the
latest improvements (one of the reasons I have written a book on the
subject, scheduled for December 2004).
This article is a response to Gordon Cormack's a research paper by Gordon Cormack and
Thomas Lynam entitled,
"A Study of Supervised Spam Detection applied to Eight Months of
Personal E-Mail". Cormack's This paper was recently featured on Slashdot
which has unfortunately resulted in a large swarm of geeks with wrong
information, drawing wrong conclusions about statistical filtering.
It is not my desire to flame the test or the tester, testers, but there are many
errors I believe need to be brought to light.
The testing fails in many ways, and has unfortunately marred many superior
statistical filters, such as CRM114, which CRM114 (which has proved statistical
superiority time and time again in practice. again). CRM114 isn't my puppy, but I
do believe it to be one of the (if of, if not THE) the, most accurate filters in the
world. I
haven't tried to get into very deep philosophical problems with the testing,
although there are some, but I've tried instead to provide a list of reasons
the testing was performed incorrectly. Perhaps after reading this,
the researchers might make another attempt to run a successful test. Until then, I'm afraid
these results are less than credible.
Don't get me wrong, I'm glad to see that all the spam filters tested did
very well. When we're measuring hundredths of a percent of accuracy, though,
good enough doesn't cut it. The intricacies of testing can easily throw
the results off a point or two which makes a considerable impact on the
results. All of these filters do an excellent job at filtering spam, as proved
by this test and others,
but the test failed to conduct an adequate comparison due to many flaws.
Paul Graham has spoken of conducting a bake-off at the next MIT Spam
Conference. Hopefully, these tests provided some "gotchas" to watch for.
The Challenge of Testing
Statistical filtering is considered by most as the next generation technology
of spam filtering. Statistical filtering is dynamic, in that it learns from
its mistakes and performs better with each correction. It has the unique
ability to learn to detect new types of spams without any human intervention.
This has all but obsoleted the many heuristic filters once considered
mainstream, and even tools such as SpamAssassin have incorporated statistical
components into their filters.
Statistical language analysis is unlike any other spam filtering approach we've seen and because of this, tests different from any other beast. The testing approaches used to measure heuristic spam filters are frequently and erroneously applied to statistical filters resulting in poor testing results. The intricacies of machine learning require a more scientific approach than simply throwing mail at the filter, and even the most detailed approaches to testing such a tool only barely succeed in accomplishing a real-world simulation.
Modern day language classifiers face a very unique situation - they learn based on the environment around them. The problem is therefore one of extremely controlled environment. When heuristic filtering was popular, there were many different ways to test it. Since the filter didn't base its decisions on the previous results, an accurate set of results would accommodate just about any type of testing approach used.
The state of a statistical language classifier is similar to that of a sequential circuit in that the output is a combination of both the inputs and the previous state of the filter. The previous state of the filter is based on a previous set of inputs, which are based on a previous set of results, and so on.
Think of controlled environment in terms of going to the supermarket every week; what you buy from visit to visit is based on what you have in your refrigerator. A single change in the environment (milk going sour) can easily snowball to affect the results of a filter by many messages, and change your milk
buying patterns for many weeks. With this in mind, the challenge of testing is to create an environment that simulates as closely as possible a real-world behavior - after all the accuracy we are trying to measure is how the filter will work in the real world. It is therefore suffice to say that testing a statistical filter is no longer a matter of testing, but one of simulation. Simulating a real-world behavior takes many factors into consideration that obsolete heuristic testing doesn't.
Of course, when you're not testing to measure accuracy, this type of simulation isn't always necessary. Chaos in message ordering and content may be appropriate when testing to compare features for a particular filter or any other kind of blind test where accuracy isn't is important as deviation, but that's not
what this paper was testing. Some of these requirements were relayed to Cormack,
and he did a fair job of implementing some of them. Others, he wasn't
so lucky with.
Message Continuity
Message continuity refers to the uninterrupted threads and their message content, dealing specifically with the set of test messages used. Threading is
important to a statistical filter as is message ordering. Statistical filters
learn new data incrementally as it is mixed with already known data. As
email evolves (spam and legitimate mail alike), characteristics slowly change.
If the messages are presented out of order, incremental learning breaks. This
does not appear to have been a problem with the tests, as Cormack claimed to
track the original ordering.
Archive Window
As we've learned through Terry Sullivan's research at the MIT Spam Conference
in 2004, spam evolves on the order of months. Other tests confirm this and show us that the seas change every 4 to 6 months on average. It's important to have
a concurrent archive of mail if we're going to use one archive for training.
This test used incremental learning without a training corpus, and so this too
was not a significant issue - although I do believe his method of training
in general was flawed by not using an archive window for pretraining
(discussed later).
Purge Simulation
An area frequently disregarded in a statistical learning simulation is the purging of stale data. When the training corpus is learned, each message is trained within the same short period of time (usually a period of several minutes or hours). The usual method of purging that a particular filter might employ doesn't take place because all of the data trained is considered new. Purging is important
because older data can affect the polarity of newer data. For example, if
tokens that haven't been seen in four months reflect a spammy polarity,
then are purged, and a couple new messages come in using those tokens in
a legitimate way, then purging will allow the tokens to take on their
most recent polarity. Without purging, the tokens would become fairly
neutral and be eliminated from computation, ultimately affecting accuracy.
Another area purging affects (which directly affects accuracy) is the amount
of data required to migrate the polarity of tokens in the dataset for
training or retraining. If an old record exists for a particular token with
100 data points, that record will take much more time to change polarity than
a fresh record or one with few datapoints. By not purging the old data,
it not only lingers but it causes the tokens to "stick" much more.
This test did not include any purge simulation at all, leaving 49,000 someodd
messages trained as a composite in the wordlist.
Interleave
The interleave at which messages from the corpus are trained, corrected, and classified can play a dramatic role in the results of the test. Many tests are erroneously performed by feeding in two separate corpora - one of legitimate mail and one of spam. Some tests use a 1:1 interleave, while others try their best to simulate a real-world scenario. The original ordering of the messages in the corpus will generally yield the most realistic results. Cormack claims to have
preserved the message ordering, and Lynam confirmed recently that the spam
and ham was kept in the same file, so it looks like interleave was
preserved in this test.
Corrective Training Delay
The delay in retraining classification errors is probably one of the most difficult characteristics to simulate. When a misclassification occurs, the user doesn't immediately report it - several other messages are likely to come in before the user checks their email and corrects the error. What's more, submitting
an error changes data - which could cause more errors in some cases.
Delay creates ehther a snowballing
affect - sometimes good, sometimes bad...nevertheless, or a delay in mistraining the database. The result can be good or bad,
but nevertheless, it's critical to an accurate simulation. The test simulation
retrained immediately when
an error has occured, not allowing the error to propagate or affect any other
decisions. This most likely resulted in inaccurate results - especially
difficult to spot where heuristic functions and statistical functions were used
together.
General Grievances
My most mentionable grievance with the testing involved is that it
The tests performed in Cormack and Lynam's paper wasn't a true
simulation. Only some
of the criteria I listed above were fulfilled. While the original message
ordering and possibly the interleave were preserved, no purging was performed
and no training delays were used. What's more, there was no archive window,
because the testers didn't perform any no initial training was performed before taking
measurement.
Statistical filters know nothing when you train them. Therefore, if you're
going to measure their accuracy, you need to train them first. If you start
measuring before you've taught measurements, which
grossly threw off the filter anything, then you're going to
end up with some pretty mediocre results.
This seems to be part results of the failure of this test. Many filters
have an initial training cycle in which many features are disabled. DSPAM
specifically disables many advanced features such as Bayesian Noise
Reduction until 2500 innocent messages are learned - it likes testing.
A Closed Test
The scientific process demands peer review. Cormack has
refused to play it very
safe unless told otherwise by the user (who will usually wait, turn make his test code (even without the knob, mail corpus) or train a corpus of mail). More importantly to his configuration log or other notes available. This makes the DSPAM results test very hard to trust
as nobody is
an algorithm called statistical sedation able to really look inside and validate his work, or find bugs in his
code. His tests assume that he hasn't made any errors in implementation which
is a tunable feature that
waters down filtering until the training cycle is complete - in order to
prevent false positives. Users who would like better accuracy on
day 1 can turn this knob in one direction. It doesn't appear this feature
was disabled in the tests (which would obviously explain his weird
regression curve for DSPAM), nor does it appear that any
acceptable levels of training were performed before taking measurement. I have no doubt
this is what resulted in the mediocre results of many filters.
Many other issues discredit the findings of this test as well. I've outlined
them below. The scientific process demands peer review. Cormack has refused
to make his tests (even without the mail corpus) available, as well as
his configuration log. most likely very incorrect. In order for any scientific test to be valid, it
must be reviewed by an independent party (or many parties). If these tests had
been made public, it wouldn't surprise me to see much more public support
and possibly even some contributions by developers (including myself)
to make the tests better.
I am very suspicious of any closed test, regardless of the results. Since the
filter authors were not directly involved in these tests, their reliability is
limited to the extent that they were implemented correctly. I'm afraid I can't
give any credibility to a test that cannot be reviewed.
Old Versions of Software Were Used
If you're going to conduct tests, for the sake of science use recent versions of the
software. It was reported that version 2.8.3 of DSPAM was used - v2.8 is
well over two major production releases and six months old! In fact, version
2.10 had been released for a month when I was originally contacted by Cormack in
April, but appeared to still be using a version from January (I'm only
finding this out now).
In fact,
3.0.0 had also been under development
for about three months prior to its release, with public betas available. As I understand it,
older versions of other software were also used, such as Bogofilter (0.17).
At the very best, this test shows us the state of spam filtering from
early releases of these tools, which are more than six months old, meaning
we're dealing with spam as old as 14 months -
even had this test been
conducted without errors, it it is already obsolete as of its publish. publish.
At least this is based on the fact that he was using software six months old
(from January '04 - Mar '04). His article claims the corpus started in August
2003,
but that doesn't make much sense as that would put the date of his test at
around March or April '04, which of course outdates the software he was
using. So either he used software much older than what was available
(even in production) or his numbers are wrong.
The Test Subject was Anything but Typical
The introduction to
If this is the research paper makes case, then the following statement:
"While our study is limited to software pre-dated the extent spam that X's email was being used
to measure the filter. This is typical"
Yet later on a big no-no in testing. If you test 2004
safety features, you don't test with 2003 vehicles, unless you are specifically
testing for the paper we read that X's email consisted of over 49,000
emails over a period effectiveness of eight months. The paper also makes this statement _older_ models in comparison to newer ones
(and they weren't). In this
test, the versions of software used should be just as recent as the mail
archived for testing. While statistical filters are excellent at learning new
types of spam, many unrelated tactics also affect the filter software, such as
new encoding tricks and such - things that require minor tweaking of the
software.
The Test Subject was Anything but Typical
The introduction to the research paper makes the following statement:
"While our study is limited to the extent that X's email is typical"
Yet later on in the paper we read that X's email consisted of over 49,000
emails over a period of eight months. The paper also makes this statement
about X:
X has had the same userid and domain name for 20 years. Variants of X's
email addresses have appeared on the Web, and in newsgroups. X has
accounts on several machines which are forwarded to a common spool file,
where they are stored permanently in the order received.
This seems very atypical. atypical which greatly limits the usefulnes of this
study. The test subject does not represent typical
email behavior, except among the most hardcore
geeks. Even still, typical hardcore geeks will adjust this behavior in an
attempt to curve spam. The typical technical user (someone who makes his
living online) will have the same email address for perhaps five or more years,
and the typical non-technical user (a majority of the users on the Internet,
lest we forget) will change email addresses every couple of years.
In either case, most sane users use one or two variants at the most. 49,000
emails in eight months is
also absurd. A good test should have included independent tests with
corpora from 10-15 different
test subject, of all walks of life - geek, doctor, etc. Since X's email isn't available for examination, we can only draw
some assumptions, which make for a strong case that the test subject was
not typical and may have helped provide skewed results:
- Due to X's
extremely high volume of traffic and the fact that X's email addresses were available to harvest bots on the Web and in
newsgroups for 20 years, it is no surprise that
X has an abnormally high spam ratio, 81.6%. The
typical user has a spam ratio of perhaps 60% with 80% being very high
(including geeks who have
had the same email address for years). Having an abnormally high
spam ratio, many spam filters are likely to perform at less-than-optimal
levels without basic tuning. This is for two reasons. First, an overabundance of spam can seed
the filter's wordlist with tokens that would otherwise be considered legitimate,
but because the user only receives a small percentage of legitimate mail,
these words become "flooded", even in a Train-on-Error situation.
This can leave a wordlist with an
underabundance of legitimate tokens due to this flooding. Secondly, the algorithm used to calculate
token value in most statistical filtering approaches relies to some degree on
compensating for an unbalanced corpus. Having an unusual ratio of spam
(+80%) can cause some filters to overcompensate (unless tuned properly) and
result in less than optimal levels of accuracy. The same is true in the
other direction as well - an overabundance of legitimate mail with very
few spams will result in a significant number of spam misses due to these
algorithms overcompensating. In practice, users with a massively unbalanced
corpus of mail, such as the test subject, would need to perform some
additional tuning and possibly corpusfeeding in order to achieve optimal
results.
- The test subject used many different variants of email addresses, which provided many
different variants of header information to analyze, and possibly a very
unbalanced set of
of header information. data. For example, if X had 20 email
addresses, but only used 4 of them for day-to-day legitimate mail, then that means
a statistical spam filter would learn that any mail addressed to the other 16
would most likely be spam - providing an unbalanced set of header information
to analyze. For the occasional (low-traffic) messages filtering into one
of these 16 boxes, this is a death sentence. Typical users do not have more
than a few variants, and typically only ones that they are using to receive
legitimate mail. In fact, if a user is experiencing poor levels of accuracy
one of the first things I ask them is if they have a bunch of unused email
addresses active - whenever one does, and we turn it off, accuracy
improves.
- With over 49,000 messages in eight months, it is reasonable to say that
this user was very active on email. The fact that he/she was on newsgroups
and the web suggests an overly diverse email behavior, which requires at least
a few different options to be selected for that user. TOE mode is terrible at
identifying new kinds of email behavior - they should have used
Train-Everything or Train-until-Mature for all filters. Cormack makes the
claim that DSPAM and CRM114 don't support Train-Everything mode, but it is
in fact the default for DSPAM, and is supported by CRM114. For DSPAM v3.0.0,
I would have recommended trying both Train-Everything and Train-until-Mature.
The Accuracy of the Test Subject's Corpus is Questionable
The research paper claims that the mail corpus was run against SpamAssassin for
classification and compared to existing results. Any discrepencies were
resolved by a human. This fails to account for two issues, which question
just how accurate the corpus was in the first place:
- It appears from the wording of the paper that messages which SpamAssassin
believed were correct in both runs were
accepted without human review. The fact that the same
program was used to determine the results of the corpus suggests it is
extremely likely
that both versions of the software could make the same mistakes repeatedly,
as is the problem with monocultural spam filtering. This would
go unchecked. SpamAssassin, with learning turned off, is advertised to be
only as accurate as 95% (making 1 error for every 20 messages). Instead, what
the testers should have done is run a different
spam filter, or perhaps two or three other filters on the corpus to
determine conflicts requiring human attention. You can't use a less
accurate tool to prepare a test for a more accurate tool!
NOTE: The paper does make claims that discrepencies noted during the tests
were examined, but this appears only to be an afterthought. The gold standard
itself was described as being initially set between two copies of SpamAssassin
and the user. Lynam claims that tests were re-run if an error was found, but
it doesn't seem as though the testers would have been looking for errors
during this phase, as the gold standard had already been established.
Ideally, all of this should've been a part of defining the gold standard
in the first place.
- All conflicts were resolved by a human (assumably the test subject) after
the fact; e.g. they were not resolved during the eight month period in which
mail was collected. The fact that there were any conflicts in the first
place prove that the test subject was not very accurate at manually
classifying their own mail - if they were, there would not be any conflicts.
The accuracy of the test subject would obviously diminish eight months after
the fact, so if they weren't very accurate to begin with, there were most
definitely many human errors made when conflicts were resolved. Bill Yerazunis
did a study of human accuracy and classified his mail several times by hand.
He came to the conclusion that humans were typically only 99.84% accurate.
If the corpus was, at the most, 99.84% accurate (and this is generous being
that it was classified eight months later) then that means that any tools which
were more accurate would spot these errors, and appear as erroring themselves.
The Corpus was Classified by SpamAssassin, for SpamAssassin
SpamAssassin is immediately eliminated from the credibility of these results
because the test corpus was classified by SpamAssassin (twice) and
the test was ultimately a product of SpamAssassin's decisions. Everyone
knows that computers do
exactly what they're told to do. Even if SpamAssassin made 1,000 errors,
it would most likely make some of those errors again even with a learning
piece enabled. Regardless of the accuracy of the corpus, SpamAssassin was
tailored specifically to act as the referree for the mail corpus being used,
and therefore will obviously provide the desired results (or similar,
with learning enabled). If you use a tool that is only 95% accurate to
prepare a test for tools that are 99.5% accurate, then the lesser tool will appear to
outperform the better tools whenever the better tools are correct. This could have been avoided had many filters
been used to classify the corpus, or had the test been limited to a manageable
number of emails.
Furthermore, heuristic functions were designed to specifically detect the
spams used. The emails being 8 months old, heuristic rules were clearly
updated during this time to detect spams from the past eight months. The tests
perform no
analysis of how well SpamAssassin would do up against emails received the next
day, or the next eight months. Essentially, by the time the tests were
performed, SpamAssassin had already been told (by a programmer) to watch for
these spams. A fair test should have used a set of heuristic rules eight
months older than the corpus of mail or more (note, I didn't say software 8 months old). In all likelihood, many of the spams collected at the beginning
of the test had already had rules specifically coded for them, so it's possible
going back 12 months might be necessary to remove the human programming.
What good is a test to
detect spam filter accuracy
when the filter has clearly been programmed to detect its test set?
Examining Impartiality and Experience
Mr. Cormack approached me some months ago wanting to perform some of
these tests
Pretraining Existed for a paper he was writing. Some Tests, Not to make any personal attacks against Mr. Cormack, but there are two
qualities Others
Of course, this raises the issue that should be examined in SpamAssassin was the tester equivalent of any test - impartiality
and experience. No matter how precise any test appears a
pre-trained filter, while all the other filters were not trained. A significant
amount of pre-intelligence was embedded into SpamAssassin prior to be, testing
(rules written by a human specifically to detect these two
qualities spams). If
we are always a factor in to measure pre-trained filters, they should be pitted against other
pre-trained filters. The testers argue that the results entire process was graphed.
Unfortunately, this is not sufficient. All of the testing.
The credibility filters were measured from
their starting point - of a forensic pathologist which SpamAssassin was given the obvious advantage
by being pre-trained. Results should not have been measured until each filter
was sufficiently trained as well. Also, pre-training is critical very different
from learning. When filters learn, they learn differently than they do when they're in court
explaining the evidence they've gathered
they pre-train. For example, SpamProbe and DSPAM perform test-conditional
(or iterative) training, which re-trains certain tokens until the methods they used. The same erroneous
condition is true of testing, and even no longer met. When these filters are pretrained, however, the
tokens are trained only once for each message. This leaves the dataset in
a good test should go through very different state.
This seems to be part of the healthy
exercise failure of examining impartiality and experience. I personally am this test. Many filters
have an initial training cycle in no position which many features are disabled. DSPAM
specifically disables many advanced features such as Bayesian Noise
Reduction until 2500 innocent messages are learned - it likes to
perform comparison testing because I am play it very attached to my project (DSPAM).
You won't see me performing these types
safe unless told otherwise by the user (who will usually wait, turn the knob,
or train a corpus of tests on my own, and if I do you
are welcome mail). More importantly to call me on it. Similarly you won't find me trying the DSPAM results is
an algorithm called statistical sedation which is a tunable feature that
waters down filtering until the training cycle is complete - in order to perform
tests
prevent false positives. Users who would like better accuracy on SpamAssassin because I am not experienced enough with
day 1 can turn this knob in one direction. It doesn't appear this feature
was disabled in the software.
Instead, I rely on tests (which would obviously explain his weird
regression curve for DSPAM), nor does it appear that any
acceptable levels of training were performed before taking measurement. This is probably what resulted in the mediocre results others have come to. of many filters.
Cormack seemed argumentative and appeared
Bill Yerazunis had the excellent idea of performing two tests: the first
test measured the ability to have some negative presuppositions about statistical detect spam, but then the second test would
flip around the corpus (what is spam filtering,
which is why I decided not now ham, what is ham is now spam),
and the filter would then be instructed to help him at detect spam again. The _worst_ of
the time, except for answering
a few basic questions two tests are the results to be used. This would remove any pre-intelligence
programmed into any filter and pointing him measure them based on their ability to detect
what the documentation. Call me crazy, but I believe that any honest test must user tells them is spam. Unfortunately, it doesn't look like this
will be free from bias, especially if submitted as
a research paper. The makings incorporated into the testing.
Closed Test + No Filter Author Involvement = Many Potential Misconfigurations
Mr. Cormack approached me some months ago wanting to perform some of
these tests for a great researcher start with impartiality. paper he was writing.
Understandably, Mr. Cormack was very frustrated about not being able to achieve
any reasonable levels of filter accuracy from his filters, including
DSPAM. It turned out that Cormack he was using the wrong flags, didn't understand how
to train correctly, and seemed very reluctant to fully read the documentation.
I don't mean to ride on Cormack, but proper testing requires a significant
amount of research, and research seems to be the one thing lacking from this
research paper.
Another thing that concerned concerns me was when is the level of experience Mr. Cormack informed me has
with statistical filters. Cormack argues that
he had never he's used SpamAssassin and Mozilla,
with a statistical little bit of experience on some others...but this doesn't seem like
sufficient experience with a command-line, server-side _pure statistical_ filter prior to his testing. The frustration
Cormack appeared to experience would have come from being me, at least ones like
those he measured, which require a little green - and
that's understandable. much different level of experience than
these tools.
It does concern me though that it may have been out
of frustration that Cormack decided to test with a six month old version of
DSPAM
(instead of the version available when he contacted me, only one month old) and used several flaming words such as "inferior" in his paper to describe
the software.
Sadly, the bias that I felt when
talking with Cormack discredited any possible findings he could make to me,
and I told him this.
I
sincerely hope Cormack learns from this and makes attempts to improve his
testing in the future.
I am
I'm not trying to slander Mr. Cormack, but I think it's important to note
that any a closed test without enough experience isn't going to yield the desired results. This is why an open
test with bias and inexperience is invalid. so important, as well as review. In my opinion, I do not believe Cormack has made the effort to become as knowledgeable in
this area as he needs to be to run tests.
(Some of all of) The Spam Filters Tested Were Misconfigured
As with many rushed tests, the testers don't always find the time to become
intimate with the tools they're testing. In this case, it was very obvious to
me when originally speaking with Cormack that he was using the software
incorrectly, but in his research paper it appears that the documentation was
not adequately consulted even up to the test.
It states on page 30 of the research paper that DSPAM doesn't support
Train-Everything mode, and that training was performed using Train-on-Error.
Train-Everything mode was the first mode available in DSPAM, and Train-on-Error
was only coded into the software as of version 2.10, so the test had to be
using Train-Everything without knowing it, and treating it like TOE.
(NOTE: I've been informed that Cormack would be correcting his paper to
reflect this). Many other incorrect statements about
the different filters he's using suggests to me that the
testers still didn't understand the filters he'd been testing.
One of my last exchanges with Cormack before his testing involved his approach
to training. It appeared as though errors were not being retrained correctly
which I am confident made a significant contribution to his poor results
with DSPAM.
Instead of presenting a message as an error, It was submitted
as a corpusfed spam. This would have learned the tokens as spam, but not
un-learned the erroneous innocent hits on each token - so the message became
learned as both ham and spam. The tests also failed
to present the outputted message, but presented the original message for
retraining. Unless specifically configured to do so (his copy was not),
DSPAM looks for an embedded "watermark" it has added to each email it
processes. This watermark provides a serial number referencing the original
training data. When it cannot be found, only the message body is retrained
(e.g. it tries to do its best assuming the user forwarded in the spam, and
so you'd have their headers instead).
By providing the original message for retraining, and not the DSPAM
processed message (with watermark), DSPAM was (at the time at least)
training only half of each message - leaving the headers without retraining.
I provided him with this information, but I'm not entirely certain that the
corrections made were sufficient prior to testing.
This is understandably confusing, and was ambiguous enough of a "feature" that
it was removed in v3.0.0, but because the documentation was not followed,
most likely caused unpredictable results in testing and practice.
It's very odd that the paper would reference v3.0.0's peak 99.991%
(which is just the peak - the highest it can go under ideal conditions), but
would be using 2.8 to measure this. One problem I think is evident is that
Mr. Cormack spoke with me in April (at which point v2.10 had been out for
over a month), but was apparently using v2.8 (a version released in January).
This may have been part of the problem with a "feature" that
it was removed in v3.0.0, but because the results, as well as my
efforts to help him out. documentation was not followed,
most likely caused unpredictable results in testing and practice.
In fact, we really don't know how these tools were configured or what backends they
used. DSPAM supports six different possible back-ends (some of which are
beta, and some of which are unsupported), as well as three different training
modes (TOE, TEFT, and TUM). DSPAM also supports Graham-Bayesian, Burton-Bayesian,
Geometric Mean, and Robinson's Chi-Square algorithms. On top of this, there
are two different algorithms for computing token value, and plenty of other
knobs. We have no idea how each tool was specifically
configured nor did anyone involved in testing appear to post configurations or specific
details about their testing approach.
The Tests Invalidate Themselves By Lack of Real-World Validation
The tests don't come close to reflecting real-world levels of accuracy
experienced by many filters. CRM-114 users experience typical levels of
accuracy surpassing 99.95%, yet the tests show otherwise. The same is true
for many other filters. Even the ones that were rated fairly did not reflect
the real-world accuracy I hear about. When the results of a test don't even come close to
human experience, the tests are possibly erroneous and need to be analyzed,
not published. If the tests are reviewed by one or two independent parties,
retried, and nothing can be found wrong with them, THEN publish them - even if
they go against the accepted performance...but that's not the case here. No
retrials were performed, no independent party confirmed the validity of these
tests, and as a result we ended up with some very oddball results.
Please note again, CRM114 is not my tool - I'm not affiliated with it in any
way except that I hold it in very high regard having seen how it functions
mathematically and am quite certain of its mathematical superiority in both
theory and practice.
I suspect these tests could have been compared at one point to Cormack's own experience,
which as he informed me were very poor with statistical filtering - this was
due primarily to poor implementation on Cormack's part, having analyzed his
own configuration personally.
Conclusions
I could continue to rant on about the quality of
Many technical errors have been made in this test, so many in fact that the
test but I don't
think that's necessary.
The only conclusions is beyond recovery in my opinion. I can draw, unfortunately, believe a new test is in order -
one that Cormack (in my
opinion) was corrects the wrong
person to conduct deficiencies outlined in this test article, and more
importantly an open test that the tests had many technical flaws. He had claimed to never have used filter authors can be involved in.
As a statistical filter, and
appeared to have plenty result of preconceived grievances with it. This ultimately
led, in my opinion, to the many technical errors which deficiencies and the general mystery behind
how these tests were made. really performed, I do not believe these tests to be credible - especially not
credible enough to appear in any journal.
Mr Lynam seems much more interested in the scientific process and
far less argumentative. I would be interested in seeing him pair up with a
different party to conduct a newer, better test.
I sincerely hope Cormack considers these errors are considered and improves his improved in their testing. I
am confident that he they will find tools such as CRM114 and DSPAM to prove
as extremely accurate as their loyal users are finding them.
NOTE: I have spoken to Gordon about this article and although we have many
disagreements about his test and my comments, we are working on analyzing
what exactly went wrong (or at least that's my perspective) - not to further
discredit his test, but what I hope will improve any future testing. It's
difficult though, as Cormack won't release the code used to perform his
testing or configuration logs for the software as I'm also confident
that every last one of yet. the statistical filters measured will prove superior.
|
|