Date: Thu, 8 Jun 2006 07:21:25 -0400 (EDT) From: "Gordon V. Cormack" To: Multiple recipients of list Subject: Pilot/Deadline/Restarting Spam Filters TREC 2006 Spam Track Notes on Submissions and Robustness 1. This year there is no explicit pilot filter submission. If you did not do the task last year or are unsure if your implementation will work in the test environment, please contact me directly. Within reason, I can run examples of your code on a test system and tell you the result. I cannot promise to do any such run in last week prior to the submission deadline. 2. This year deadlines will be strict and enforced by NIST. All submissions must be through the NIST submission system, and must strictly adhere to the submission requirements. This means on-time, with one file per filter, and one file per result run. There is also a filter and run naming convention that must be used. The dates have been posted with the guidelines; details of the submission system will be announced in due course. 3. If a "classify" or "train" command crashes, the result for the message will be recorded as "class=ham score=0" and the filter will be resumed with the next message. If you are using a client/server system, you should consider the possibility that your client or server may crash and detect and recover from this eventuality. Also your "initialize" should work properly even if there's a server running from a previous run (whether the run failed or completed successfully). 4. Filters that use more than 2 seconds (wall clock time) per message (cumulative with a minute's grace for startup) will be killed and the result will be recorded as "class=ham score=0" for any unprocessed messages. 5. Each participant may submit up to four filters for each task (filtering and active learning). At least one filtering submission per group will be evaluated by us on several corpora -- some with ideal feedback and some with delayed feedback. 6. Each participant must run the same filters on four public corpora and submit the results: - TREC 2006 English (no delay) -- 72K messages - TREC 2006 English (with delay) -- 72K messages - TREC 2006 Chinese (no delay) -- 65K messages - TREC 2006 Chinese (with delay) -- 65K messages