Scalability of Continuous Active Learning for Reliable High-Recall Text Classification
Gordon V. Cormack & Maura R. Grossman
To be presented at CIKM 2016.
Download Authors' Copy
Permanent link
Abstract
For finite document collections, continuous active learning
("CAL") has been observed to achieve high recall with high probability,
at a labeling cost asymptotically proportional to the number of relevant
documents. As the size of the collection increases, the number of relevant
documents typically increases as well, thereby limiting the applicability
of CAL to low-prevalence high-stakes classes, such as evidence in legal
proceedings, or security threats, where human effort proportional to the
number of relevant documents is justified. We present a scalable version
of CAL ("S-CAL") that requires O(log N) labeling effort and O(N log N)
computational effort, where N is the number of unlabeled training
examples, to construct a classifier whose effectiveness for a given
labeling cost compares favorably with previously reported methods. At
the same time, S-CAL offers a calibrated estimate of class prevalence,
recall, and precision, facilitating both threshold setting and
determination of the adequacy of the classifier.