From firstname.lastname@example.org Mon May 29 21:40:42 2006
Subject: Timetable & New Spam Evaluation Kit
Here are the deadlines for the Spam Track:
July 13 - Filter submission.
August 23 - Public corpus result submission.
The filters you submit on July 13 must work with
the spam filter evaluation toolkit (see below).
After filter submission I will release the
public corpora. You should then run the
filters you submitted (without modification)
on the public corpora and submit the result
files by August 23.
There will be two public corpora: one with
about 40,000 mostly-English emails. One with
Chinese. Here's a small sample Chinese
I have 175,000 Chinese emails. Do you want
the whole thing, or should I cut it to 100,000
or even less? Comments welcome.
If you want the public corpora early, I'll
release them provided you assure me that you
have already submitted your Filter.
We have created a new version of the evaluation
toolkit. It should be "backwards compatible" in
that previous filters and run-files should work.
It has been extended to handle delayed feedback.
The toolkit includes the SpamAssassin corpus -
both ideal feedback (as for TREC 2005) and
delayed feedback (which TREC 2006 will do
in addition to immediate feedback).
As well as the toolkit there's an update for the
TREC 2005 public corpus -- for delayed feedback.
Download this (and the 2005 corpus if you don't
already have it), too.
Note that the delayed feedback test is considerably
"harder" than the ideal feedback, yielding inferior
results. Feel free to investigate methods to
overcome this difference in your filter.