Scalability of Continuous Active Learning for Reliable High-Recall Text Classification

Gordon V. Cormack & Maura R. Grossman

To be presented at CIKM 2016.

Abstract

For finite document collections, continuous active learning ("CAL") has been observed to achieve high recall with high probability, at a labeling cost asymptotically proportional to the number of relevant documents. As the size of the collection increases, the number of relevant documents typically increases as well, thereby limiting the applicability of CAL to low-prevalence high-stakes classes, such as evidence in legal proceedings, or security threats, where human effort proportional to the number of relevant documents is justified. We present a scalable version of CAL ("S-CAL") that requires O(log N) labeling effort and O(N log N) computational effort, where N is the number of unlabeled training examples, to construct a classifier whose effectiveness for a given labeling cost compares favorably with previously reported methods. At the same time, S-CAL offers a calibrated estimate of class prevalence, recall, and precision, facilitating both threshold setting and determination of the adequacy of the classifier.