Re Zdziarski's Factual Errors
We shall not respond to Mr. Zdziarski's attacks, except to identify
the most outstanding factual errors and to note that ad hominem
arguments are irrelevant in assessing the validity of our work.
We encourage interested parties to read our paper.
1. Gold Standard. The bottom line is that for every case of disagreement
between any filter and the Gold Standard, X re-adjudicated the message.
That means, for example, that DSPAM's 116 ham misclassifications (false
positives) and 791 spam misclassifications (false negatives) were all
examined and verified by X to be misclassifications.
Further, X did examine every message at least once in constructing the
original Gold Standard. Any remaining errors in the Gold Standard would
be to the advantage of the subject systems.
Our paper states:
All subsequent disagreements between the gold standard and later
runs were also manually adjudicated, and all runs were repeated with
the updated gold standard. The results presented here are based on
this revised standard, in which all cases of disagreement have been
vetted manually.
2. Learning configuration. DSPAM was configured exactly as specified in
the documentation supplied with v 2.8.3, and discussed with Zdziarski
in numerous emails. One change was made as the result of this
correspondence: we used the output from DSPAM as input to our
"filtertrain" procedure, as opposed to using the original message.
v 2.8.3 has no flags for "train on error" and our report as to DSPAM's
internal training behaviour was based on our understanding from our
correspondence of April 23 with Zdziarski. Based on our very recent
correspondence with Zdziarski, we now understand that DSPAM internally
trains on everything, and we will note this in the paper. In
the meantime we have placed errata on our web page.
This descriptive characterization of DSPAM's internal behaviour has
no bearing on the test setup. We used DSPAM "out of the box" as
documented. More precisely, we implemented Algorithm 1 in the
following way:
for each email (in arrival order)
dspam --stdout --deliver-spam -d < email > dspamout
if (email is ham and dspam reports spam)
dspam --falsepositive < dspamout
if (email is spam and dspam reports ham)
dspam --addspam < dspamout
As Zdziarski point out, a misunderstanding as to version number
probably accounts for our miscommunication in this matter.
3. Our paper says "Zdziarski reports 99.95% to 99.991% accuracy for DSPAM
based on an unspecified methodology." Part of the lack of specification
is the version and configuration of DSPAM used in achieving these
results. Similar claims were reported at the time of the Slashdot
article "Two Spam Filters 10 Times As Accurate As Humans." Although
no configuration was stated at this time either, it is our understanding
that v2.8.3 was the current stable release.
We believe it is scientifically appropriate to select a set of filters
and freeze them prior to the collection of results; our pilot evaluations
began in February.
4. The conclusion of this paper is that supervised statistical filters
greatly improve on the filtering capabilities of Spamassassin's static
rule base.
At no point does the paper suggest that Spamassassin's static rules
are better than statistical filters.
Only the statistical filtering component of Spamassassin (with no
static rules) was compared against the other statistical filters,
including DSPAM.
5. We measured initial error rates and final error rates, and plotted
piecewise and regression-based estimates of the error rates as a
functions of the number of messages processed. So, for example,
if the reader would like to see the performance after 10,000 messages
of training, this information may be determined from the figures.
As Zdziarski pointed out in an earlier draft, DSPAM's discontinuous
learning process can be observed in figure 16.
Thomas Lynam
Gordon Cormack
June 24, 2004
Postscript
Zdziarski's comments have been revised many times since June 24.
We thank Zdziarski for removing or qualifying some of the most egregious
ad hominem statements.
In response to point 1 above, Zdziarski qualifies his assertions
about the manner in which we constructed our gold standard. We stand by our
original description, which is amplified above.
With regard to the configuration used for DSPAM, we state that the
description given in point 2 accurately reflects the setup we used for
DSPAM. This setup has been invariant since April 23, and is the setup we
used to evaluate DSPAM's performance. The DSPAM distribution that we
used is reproduced here. This version does not have the training parameters
"TOE, TEFT, and TUM". Its only training parameter, --enable-test-conditional,
is not recommended and was not used.
Zdziarski appears to have removed reference to his unsubstantiated
accuracy claims for DSPAM, which are noted in point 3.
We re-assert points 4 and 5.
Gordon Cormack
Thomas Lynam
June 27, 2004
DSPAM Version and Configuration Information
As stated in our paper, we used DSPAM 2.8.3.
The configuration log is reproduced here.