A Multi-Corpus Evaluation of Dynamic Markov Coding for Spam Filtering

Gordon V. Cormack, professor

David R. Cheriton School of Computer Science
University of Waterloo

presented to Microsoft, April 3, 2006

Video presentation Presentation slides (pdf)

In the TREC 2005 Spam Evaluation Track, a number of popular spam filters - all owing their heritage to Graham's 'A Plan for Spam' - did quite well. Machine learning techniques reported elsewhere to perform well were hardly represented in the participating filters, and not represented at all in the better results. A non-traditional technique - Prediction by Partial Matching (PPM) - performed exceptionally well, at or near the top of every test. Are the TREC results an anomaly? Is PPM really the best method for spam filtering? How are these results to be reconciled with others showing that methods like Support Vector Machines (SVM) are superior? I address these issues in three different ways. First, I show that my method of Dynamic Markov Coding (DMC), a PPM competitor, achieves results at least as good as PPM on the TREC tests. Second, I show that PPM and DMC outperform SVM - and all other reported results of which I am aware - on the public Ling Spam, PU1, and PU3 corpora using 10-fold cross validation. Third, I adapt implementations of SVM, Perceptron, and kNN filters to the TREC test methods, where they demonstrate inferior performance, even with pre-training that should be to their advantage.