Advancements in machine learning and natural language processing have made text classification increasingly attractive for information retrieval. However, developing text classifiers is challenging when no prior labeled data are available for a rare category of interest. Finding instances of the rare class using a uniform random sample can be inefficient and costly due to the rare category’s low base rate. This work presents an approach that combines the strengths of text classification and Boolean retrieval to help learn rare concepts of interest. As a motivating example, we use the task of finding conversations that reference firearm injury or violence in the Crisis Text Line database. Identifying rare categories, like firearm injury or violence, can improve crisis lines' abilities to support people with firearm-related crises or provide appropriate resources. Our approach outperforms a set of iteratively refined Boolean queries and results in a recall of 0.91 on a test set generated from a process independent of our study. Our results suggest that text classification with Boolean retrieval initialization can be effective for finding rare categories of interest and improve on the precision of using Boolean retrieval alone.
© 2023 RTI International.