From trecspam@nist.gov Mon May 29 21:40:42 2006 Subject: Timetable & New Spam Evaluation Kit Here are the deadlines for the Spam Track: July 13 - Filter submission. August 23 - Public corpus result submission. The filters you submit on July 13 must work with the spam filter evaluation toolkit (see below). After filter submission I will release the public corpora. You should then run the filters you submitted (without modification) on the public corpora and submit the result files by August 23. There will be two public corpora: one with about 40,000 mostly-English emails. One with Chinese. Here's a small sample Chinese corpus: http://plg.uwaterloo.ca/~gvcormac/corpus/. I have 175,000 Chinese emails. Do you want the whole thing, or should I cut it to 100,000 or even less? Comments welcome. If you want the public corpora early, I'll release them provided you assure me that you have already submitted your Filter. -- We have created a new version of the evaluation toolkit. It should be "backwards compatible" in that previous filters and run-files should work. It has been extended to handle delayed feedback. http://plg.uwaterloo.ca/~gvcormac/jig The toolkit includes the SpamAssassin corpus - both ideal feedback (as for TREC 2005) and delayed feedback (which TREC 2006 will do in addition to immediate feedback). As well as the toolkit there's an update for the TREC 2005 public corpus -- for delayed feedback. Download this (and the 2005 corpus if you don't already have it), too. -- Note that the delayed feedback test is considerably "harder" than the ideal feedback, yielding inferior results. Feel free to investigate methods to overcome this difference in your filter.