Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It
Published online by Cambridge University Press: 19 March 2018
Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher’s substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts.
- Copyright © The Author(s) 2018. Published by Cambridge University Press on behalf of the Society for Political Methodology.
Authors’ note: We thank Will Lowe, Scott de Marchi and Brandon Stewart for comments on an earlier draft, and Pablo Barbera for providing the Twitter data used in this paper. Audiences at New York University, University of California San Diego, the Political Methodology meeting (2017), Duke University, University of Michigan, and the International Methods Colloquium provided helpful comments. Suggestions from the editor of Political Analysis, and two anonymous referees, allowed us to improve our article considerably. This research was supported by the National Science Foundation under IGERT Grant DGE-1144860. Replication data for this paper are available via Denny and Spirling (2017). preText software available here: github.com/matthewjdenny/preText
Contributing Editor: R. Michael Alvarez