Hostname: page-component-8448b6f56d-c4f8m Total loading time: 0 Render date: 2024-04-19T22:54:04.277Z Has data issue: false hasContentIssue false

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

Published online by Cambridge University Press:  19 March 2018

Matthew J. Denny*
Affiliation:
203 Pond Lab, Pennsylvania State University, University Park, PA 16802, USA. Email: mdenny@psu.edu
Arthur Spirling
Affiliation:
Office 405, 19 West 4th St., New York University, New York, NY 10012, USA. Email: arthur.spirling@nyu.edu
*

Abstract

Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher’s substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts.

Type
Articles
Copyright
Copyright © The Author(s) 2018. Published by Cambridge University Press on behalf of the Society for Political Methodology. 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Authors’ note: We thank Will Lowe, Scott de Marchi and Brandon Stewart for comments on an earlier draft, and Pablo Barbera for providing the Twitter data used in this paper. Audiences at New York University, University of California San Diego, the Political Methodology meeting (2017), Duke University, University of Michigan, and the International Methods Colloquium provided helpful comments. Suggestions from the editor of Political Analysis, and two anonymous referees, allowed us to improve our article considerably. This research was supported by the National Science Foundation under IGERT Grant DGE-1144860. Replication data for this paper are available via Denny and Spirling (2017). preText software available here: github.com/matthewjdenny/preText

Contributing Editor: R. Michael Alvarez

References

Blei, David M., Ng, Andrew Y., and Jordan, Michael I.. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3:9931022.Google Scholar
Buckland, S. T., Burnham, K. P., and Augustin, N. H.. 1997. Model selection: An integral part of inference. Biometrics 53(2):603618.Google Scholar
Catalinac, Amy. 2016. Pork to policy: The rise of programmatic campaigning in Japanese elections. Journal of Politics 78(1):118.Google Scholar
Chang, Jonathan, Gerrish, Sean, Wang, Chong, Boyd-graber, Jordan L., and Blei, David M.. 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems 22 , ed. Bengio, Y., Schuurmans, D., Lafferty, J. D., Williams, C. K. I., and Culotta, A.. Curran Associates, Inc., pp. 288296.Google Scholar
Denny, Matthew, and Spirling, Arthur. 2017. “Dataverse replication data for: text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about It.” http://dx.doi.org/10.7910/DVN/XRR0HM.Google Scholar
Diermeier, Daniel, Godbout, Jean-François, Yu, Bei, and Kaufmann, Stefan. 2011. Language and ideology in congress. British Journal of Political Science 42(01):3155.Google Scholar
D’Orazio, Vito, Landis, Steven, Palmer, Glenn, and Schrodt, Philip. 2014. Separating the wheat from the chaff: Applications of automated document classification using support vector machines. Political Analysis 22(2):224242.Google Scholar
Gelman, Andrew. 2013. Preregistration of studies and mock reports. Political Analysis 21(1):4041.Google Scholar
Gelman, Andrew, and Loken, Eric. 2014. The statistical crisis in science. American Scientist 102(6):460465.Google Scholar
Grimmer, J. 2010. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis 18(1):135.Google Scholar
Grimmer, Justin, and Stewart, Brandon M.. 2013. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21(3):267297.Google Scholar
Grimmer, Justin, and King, Gary. 2011. General purpose computer-assisted clustering and conceptualization. Proceedings of the National Academy of Sciences of the United States of America 108(7):26432650.Google Scholar
Handler, Abram, Denny, Matthew J., Wallach, Hanna, and O’Connor, Brendan. 2016. Bag of what? Simple noun phrase extraction for text analysis. Proceedings of the workshop on natural language processing and computational social science at the 2016 conference on empirical methods in natural language processing , https://brenocon.com/handler2016phrases.pdf.Google Scholar
Hopkins, Daniel, and King, Gary. 2010. A method of automated nonparametric content analysis for social science. American Journal of Political Science 54(1):229247.Google Scholar
James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert. 2013. An introduction to statistical learning . New York: Springer.Google Scholar
Jensen, David D., and Cohen, Paul R.. 2000. Multiple comparisons in induction algorithms. Machine Learning 38(3):309338.Google Scholar
Jones, Tudor. 1996. Remaking the labour party: From gaitskell to blair . New York: Routledge.Google Scholar
Jurafsky, Daniel, and Martin, James H.. 2008. Speech and language processing: An introduction to natural language processing computational linguistics and speech recognition . Prentice Hall.Google Scholar
Justeson, John S., and Katz, Slava M.. 1995. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1(01):927.Google Scholar
Kavanagh, Dennis. 1997. The reordering of British politics: Politics after thatcher . Oxford University Press.Google Scholar
King, Gary, Lam, Patrick, and Roberts, Margaret E. 2017. Computer-assisted keyword and document set discovery from unstructured text. American Journal of Political Science . Preprint, https://gking.harvard.edu/files/gking/files/ajps12291_final.pdf.Google Scholar
Lauderdale, Benjamin, and Herzog, Alexander. 2016. Measuring political positions from legislative speech. Political Analysis 24(2):121.Google Scholar
Laver, Michael, Benoit, Kenneth, and Garry, John. 2003. Extracting policy positions from political texts using words as data. The American Political Science Review 97(2):311331.Google Scholar
Lowe, Will. 2008. Understanding wordscores. Political Analysis 16(4 SPEC. ISS.):356371.Google Scholar
Lowe, Will, and Benoit, Kenneth. 2013. Validating estimates of latent traits from textual data using human judgment as a benchmark. Political Analysis 21(3):298313.Google Scholar
Manning, Christopher D., and Schütze, Hinrich. 1999. Foundations of statistical natural language processing . MIT Press.Google Scholar
Manning, Christopher D., Raghavan, Prabhakar, and Schütze, Hinrich. 2008. An introduction to information retrieval . Cambridge: Cambridge University Press.Google Scholar
Monroe, Burt L., Colaresi, Michael P., and Quinn, Kevin M.. 2008. Fightin’ words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16:372403.Google Scholar
Moore, Ryan, Powell, Elinor, and Reeves, Andrew. 2013. Driving support: workers, PACs, and congressional support of the auto industry. Business and Politics 15(2):137162.Google Scholar
Pang, Bo, Lee, Lillian, and Vaithyanathan, Shivakumar. 2002. Thumbs up? Sentiment classification using machine learning techniques. Proceedings of the conference on empirical methods in natural language processing (EMNLP) , pp. 7986.Google Scholar
Porter, M. F. 1980. An algorithm for suffix stripping. Program 14(3):130137.Google Scholar
Proksch, Sven-Oliver, and Slapin, Jonathan B.. 2010. Position taking in european parliament speeches. British Journal of Political Science 40(03):587611.Google Scholar
Pugh, Martin. 2011. Speak for Britain!: A new history of the labour party . New York: Random House.Google Scholar
Quinn, Kevin M., Monroe, Burt L., Colaresi, Michael, Crespin, Michael H., and Radev, Dragomir R.. 2010. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science 54(1):209228.Google Scholar
Roberts, Margaret E., Stewart, Brandon M., Tingley, Dustin, Lucas, Christopher, Leder-Luis, Jetson, Gadarian, Shana Kushner, Albertson, Bethany, and Rand, David G.. 2014. Structural topic models for open-ended survey responses. American Journal of Political Science 58(4):10641082.Google Scholar
Sebastiani, Fabrizio. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34(1):147.Google Scholar
Slapin, Jonathan B., and Proksch, Sven-Oliver. 2008. A scaling model for estimating time-series party positions from texts. American Journal of Political Science 52:705722.Google Scholar
Spirling, Arthur. 2012. U.S. treaty making with American Indians: Institutional change and relative power, 1784–1911. American Journal of Political Science 56(1):8497.Google Scholar
Steegen, Sara, Tuerlinckx, Francis, Gelman, Andrew, and Vanpaemel, Wolf. 2016. Increasing transparency through a multiverse analysis. Perspectives on Psychological Science 11(5):702712.Google Scholar
Wallach, Hanna M., Murray, Iain, Salakhutdinov, Ruslan, and Mimno, David. 2009. Evaluation methods for topic models. Proceedings of the 26th Annual International Conference on Machine Learning - ICML ’09 (4):1–8. http://portal.acm.org/citation.cfm?doid=1553374.1553515.Google Scholar
Yano, Tae, Smith, Noah a, and Wilkerson, John D. 2012. Textual predictors of bill survival in congressional committees. Conference of the North American chapter of the association for computational linguistics , pp. 793802.Google Scholar
Supplementary material: File

Denny and Spirling supplementary material 1

Online Appendix

Download Denny and Spirling supplementary material 1(File)
File 185 KB