Re Zdziarski's Factual Errors


We shall not respond to Mr. Zdziarski's attacks, except to identify 
the most outstanding factual errors and to note that ad hominem  
arguments are irrelevant in assessing the validity of our work.

We encourage interested parties to read our paper.

1.  Gold Standard.  The bottom line is that for every case of disagreement
    between any filter and the Gold Standard, X re-adjudicated the message.
    That means, for example, that DSPAM's 116 ham misclassifications (false
    positives) and 791 spam misclassifications (false negatives) were all
    examined and verified by X to be misclassifications.

    Further, X did examine every message at least once in constructing the
    original Gold Standard.  Any remaining errors in the Gold Standard would
    be to the advantage of the subject systems.

    Our paper states:

       All subsequent disagreements between the gold standard and later
       runs were also manually adjudicated, and all runs were repeated with
       the updated gold standard. The results presented here are based on
       this revised standard, in which all cases of disagreement have been
       vetted manually.

2.  Learning configuration.  DSPAM was configured exactly as specified in
    the documentation supplied with v 2.8.3, and discussed with Zdziarski
    in numerous emails.  One change was made as the result of this 
    correspondence:  we used the output from DSPAM as input to our
    "filtertrain" procedure, as opposed to using the original message.

    v 2.8.3 has no flags for "train on error" and our report as to DSPAM's
    internal training behaviour was based on our understanding from our
    correspondence of April 23 with Zdziarski.  Based on our very recent 
    correspondence with Zdziarski, we now understand that DSPAM internally 
    trains on everything, and we will note this in the paper.  In 
    the meantime we have placed errata on our web page.

    This descriptive characterization of DSPAM's internal behaviour has
    no bearing on the test setup.  We used DSPAM "out of the box" as
    documented.  More precisely, we implemented Algorithm 1 in the 
    following way:

       for each email (in arrival order)
          dspam --stdout --deliver-spam -d < email > dspamout
          if (email is ham and dspam reports spam)
             dspam --falsepositive < dspamout
          if (email is spam and dspam reports ham)
             dspam --addspam < dspamout

    As Zdziarski point out, a misunderstanding as to version number
    probably accounts for our miscommunication in this matter.

3.  Our paper says "Zdziarski reports 99.95% to 99.991% accuracy for DSPAM 
    based on an unspecified methodology."  Part of the lack of specification
    is the version and configuration of DSPAM used in achieving these 
    results.  Similar claims were reported at the time of the Slashdot 
    article "Two Spam Filters 10 Times As Accurate As Humans."  Although
    no configuration was stated at this time either, it is our understanding
    that v2.8.3 was the current stable release.

    We believe it is scientifically appropriate to select a set of filters 
    and freeze them prior to the collection of results; our pilot evaluations
    began in February.

4.  The conclusion of this paper is that supervised statistical filters 
    greatly improve on the filtering capabilities of Spamassassin's static
    rule base. 

    At no point does the paper suggest that Spamassassin's static rules
    are better than statistical filters.

    Only the statistical filtering component of Spamassassin (with no
    static rules) was compared against the other statistical filters,
    including DSPAM.

5.  We measured initial error rates and final error rates, and plotted
    piecewise and regression-based estimates of the error rates as a
    functions of the number of messages processed.  So, for example,
    if the reader would like to see the performance after 10,000 messages
    of training, this information may be determined from the figures.
    As Zdziarski pointed out in an earlier draft, DSPAM's discontinuous
    learning process can be observed in figure 16.

                         Thomas Lynam
                         Gordon Cormack
                         
                         June 24, 2004

Postscript

Zdziarski's comments have been revised many times since June 24.

We thank Zdziarski for removing or qualifying some of the most egregious
ad hominem statements.

In response to point 1 above, Zdziarski qualifies his assertions
about the manner in which we constructed our gold standard.  We stand by our
original description, which is amplified above.

With regard to the configuration used for DSPAM, we state that the 
description given in point 2 accurately reflects the setup we used for
DSPAM.  This setup has been invariant since April 23, and is the setup we
used to evaluate DSPAM's performance.  The DSPAM distribution that we
used is reproduced here.  This version does not have the training parameters 
"TOE, TEFT, and TUM".  Its only training parameter, --enable-test-conditional,
is not recommended and was not used.

Zdziarski appears to have removed reference to his unsubstantiated
accuracy claims for DSPAM, which are noted in point 3.

We re-assert points 4 and 5.

                         Gordon Cormack
                         Thomas Lynam
                         June 27, 2004

DSPAM Version and Configuration Information

As stated in our paper, we used DSPAM 2.8.3.

The configuration log is reproduced here.