Hostname: page-component-8448b6f56d-tj2md Total loading time: 0 Render date: 2024-04-20T00:21:55.357Z Has data issue: false hasContentIssue false

Lost in Space: Geolocation in Event Data

Published online by Cambridge University Press:  06 July 2018

Abstract

Improving geolocation accuracy in text data has long been a goal of automated text processing. We depart from the conventional method and introduce a two-stage supervised machine-learning algorithm that evaluates each location mention to be either correct or incorrect. We extract contextual information from texts, i.e., N-gram patterns for location words, mention frequency, and the context of sentences containing location words. We then estimate model parameters using a training data set and use this model to predict whether a location word in the test data set accurately represents the location of an event. We demonstrate these steps by constructing customized geolocation event data at the subnational level using news articles collected from around the world. The results show that the proposed algorithm outperforms existing geocoders even in a case added post hoc to test the generality of the developed algorithm.

Type
Original Articles
Copyright
© The European Political Science Association 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

*

Sophie J. Lee, Ph.D. Department of Political Science, Duke University, 140 Science Drive, Durham, North Carolina 27708, USA (sophie.jiseon.lee@gmail.com). Howard Liu, Ph.D. Candidate, Department of Political Science, Duke University, 140 Science Drive, Durham, North Carolina 27708, USA (haoliu.howard@gmail.com). Michael D. Ward, Professor of Political Science, Department of Political Science, Duke University, 140 Science Drive, Durham, North Carolina 27708, USA (michael.don.ward@gmail.com). The authors would like to thank John Beieler, Patrick Brandt, Andrew B. Hall, Andrew Halterman, Jan H. Pierskalla, and Philip A. Schrodt, as well as members of Wardlab for their insights and comments on this project. The editors and reviewers of this journal provided helpful suggestions. M.W. acknowledges support from National Science Foundation (NSF) Award 1259266. To view supplementary material for this article, please visit https://doi.org/10.1017/psrm.2018.23

References

REFERENCES

Beieler, John. 2016. ‘Creating a Real-Time, Reproducible Event Dataset’. arXiv preprint arXiv:1612.00866. https://arxiv.org/abs/1612.00866, accessed 1 May 2017.Google Scholar
Berico Technologies. 2017. ‘CLAVIN’. Available at https://clavin.bericotechnologies.com/, accessed 1 May 2017.Google Scholar
Boschee, Elizabeth, Jennifer Lautenschlager, Sean O'Brien, Steve Shellman, James Starz, and Michael Ward. 2015. ‘ICEWS Coded Event Data’. Available at http://dx.doi.org/10.7910/DVN/28075, accessed 1 May 2017. Google Scholar
Cederman, Lars-Erik, and Gleditsch, Kristian Skrede. 2009. ‘Introduction to Special Issue on “Disaggregating Civil War”’. Journal of Conflict Resolution 53(4):487495.Google Scholar
Chen, Xue-wen, and Jong Cheol Jeong. 2007. ‘Enhanced Recursive Feature Elimination’. Sixth International Conference on Machine Learning and Applications, IEEE. Cincinnati, OH. 13–15 December, 2007. Google Scholar
Cristianini, Nello, and Shawe-Taylor, John. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. New York, NY: Cambridge University Press.Google Scholar
Dasarthy, Belur V. 1990. Nearest Neighbor Pattern Classification Techniques. Hoboken, NJ: IEEE Computer Society Press.Google Scholar
D’Ignazio, Catherine, Rahul Bhargava, Ethan Zuckerman, and Luisa Beck. 2014. ‘Cliff-Clavin: Determining Geographic Focus for News Articles’. KDD ’14: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining., Association for Computing Machinery. New York, NY. 24 August, 2014. Google Scholar
D’Orazio, Vito, Steven Landis, Glenn Palmer, and Philip Schrodt. 2014. ‘Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines’. Political Analysis 22(2):224–42. Google Scholar
Feinerer, Ingo, Kurt Hornik, and Mike Wallace. 2016. ‘Package wordnet’. R package Version 0.1-11. Available at https://cran.r-project.org/web/packages/wordnet/wordnet.pdf, accessed 1 May 2017.Google Scholar
Frank, Jonas, and Martinez-Vazquez, Jorge. 2014. ‘Decentralization and Infrastructure: From Gaps to Solutions’. Working Paper No. 14-05. Andrew Young School of Policy Studies, Georgia State University, International Center for Public Policy, Atlanta, GA.Google Scholar
Freund, Yoav, and Schapire, Robert. 1997. ‘A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting’. Journal of Computer and System Sciences 55:119139.Google Scholar
GeoNames . 2017. http://geonames.org, accessed 1 May 2017.Google Scholar
Gerner, Debora, Philip A. Schrodt, and Omur Yilmaz. 2009. ‘Conflict and Mediation Event Observations (CAMEO): An Event Data Framework for a Post Cold War World’. In: Jacob Bercovitch and Scott Sigmund Gartner (eds), International Conflict Mediation: New Approaches and Findings, 287--304. New York: Routledge.Google Scholar
Grimmer, Justin, and Stewart, Brandon M.. 2013. ‘Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts’. Political Analysis 21(3):267–297.Google Scholar
Halterman, Andrew. 2016. ‘Mordecai: Full Text Geoparsing and Event Geocoding’. The Journal of Open Source Software. Available at http://dx.doi.org/10.21105/joss.00091, accessed 20 February 2017.Google Scholar
Halterman, Andrew, and Beieler, John. 2014. ‘A New, Near-Real-Time Event Dataset and the Role of Versioning’. Available at https://andrewhalterman.files.wordpress.com/2014/11/halterman-beieler_encore-event_data_and_versioning.pdf, accessed 1 May 2017.Google Scholar
Han, Xianpei, Le Sun, and Jun Zhao. 2011. ‘Collective Entity Linking in Web Text: A Graph-based Method’. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 765–774. Beijing, China. 24 July, 2011. Google Scholar
Honnibal, Matthew. 2016. ‘Spacy Usage: Entity Recognition’. Available at https://spacy.io/docs/usage/entity-recognition, accessed 1 May 2017.Google Scholar
Jones, Zachary, and Fridolin Linder. 2015. ‘Exploratory Data Analysis Using Random Forests’. Prepared for the 73rd Annual MPSA Conference. Chicago, IL. 16--19 April, 2015. Google Scholar
King, Davis E. 2009. ‘Dlib-ml: A Machine Learning Toolkit’. Journal of Machine Learning Research 10:17551758.Google Scholar
Kuhn, Max. 2016. ‘A Short Introduction to the Caret Package’. R package Version 1.6.8. Available at https://cran.r-project.org/web/packages/caret/caret.pdf, accessed 1 June 2018.Google Scholar
Lautenschlager, Jennifer, Starz, James, and Warfield, Ian. 2016. ‘A Statistical Approach to the Subnational Geolocation of Event Data’. In: Sae Schatz and Mark Hoffman (eds), Advances in Cross-Cultural Decision Making, Vol. 480, Advances in Intelligent Systems and Computing 333343. Cham, Switzerland: Springer International Publishing.Google Scholar
Lautenschlager, Jennifer, Steve Shellman, and Michael D. Ward. 2015. ‘ICEWS Coded Event Aggregations’, Harvard Dataverse Network. Version 1. Available at http://dx.doi.org/10.7910/DVN/28117, accessed 1 June 2018.Google Scholar
Liaw, Andy, and Wiener, Matthew. 2002. ‘Classification and Regression by randomForest’. R news 2(3):1822.Google Scholar
Liaw, Andy, and Matthew Weiner. 2016. ‘Package randomForest’. R package version 4.6-12. Available at https://cran.r-project.org/web/packages/randomForest/randomForest.pdf, accessed 1 May 2017.Google Scholar
Manning, Christopher D. Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. ‘The Stanford CoreNLP Natural Language Processing Toolkit’. Association for Computational Linguistics (ACL) System Demonstrations, 55–60. Available at http://www.aclweb.org/anthology/P/P14/P14-5010.pdf, accessed 1 May 2017.Google Scholar
Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, Friedrich Leisch, Chih-Chung Chang, and Chih-Chen Lin. 2017. ‘e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien’. R Package Version 1.6.8. Available at https://cran.r-project.org/web/packages/e1071/e1071.pdf, accessed 1 May 2017.Google Scholar
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. 2013. ‘Distributed Representations of Words and Phrases and Their Compositionality’. In: Christopher Burges, Léon Bottou, Max Welling, Zoubin Ghahramani, and Kilian Weinberger (eds), Proceedings of Advances in Neural Information Processing Systems, Neural Information Processing Systems, 3111–3119. Stateline, NV. 5--10 December, 2013. Google Scholar
Morton, Thomas, Joern Kottmann, Jason Baldridge, and Gann Bierner. 2005. ‘OpenNLP: A Java-Based NLP Toolkit’. Available at http://opennlp.sourceforge.net, accessed 1 May 2017.Google Scholar
Müller, Berndt, and Reinhardt, Joachim. 2012. Neural Networks: An Introduction. Berlin: Springer Science & Business Media.Google Scholar
Murphy, Kevin P. 2006. ‘Naive Bayes Classifiers’. University of British Columbia. Available at https://datajobsboard.com/wp-content/uploads/2017/01/Naive-Bayes-Kevin-Murphy.pdf, accessed 1 June, 2018.Google Scholar
OEDA. 2016. ‘Real Time Event Data/Phoenix’. Available at http://eventdata.utdallas.edu/, accessed June 25 2018.Google Scholar
Penn State University Geo Vista Center. 2017. ‘Geo Txt’. Available at http://geotxt.org, accessed 1 May 2017.Google Scholar
Python Engine for Text Resolution And Related Coding Hierarchy (PETRARCH). 2017. https://github.com/openeventdata/petrarch, accessed 1 May 2017.Google Scholar
Porter, Martin F. 1980. ‘An Algorithm for Suffix Stripping’. Program 14(3):130137.Google Scholar
Rao, Delip, Paul McNamee, and Mark Dredze. 2013. ‘Entity Linking: Finding Extracted Entities in a Knowledge Base’. In: Thierry Poibeau, Horacio Saggion, Jakub Piskorski, and Roman Yangarber (eds), Multi-Source, Multilingual Information Extraction and Summarization, 93–115. New York: Springer Science and Business & Media. Google Scholar
Schrodt, Philip A. 2006. ‘Twenty Years of the Kansas Event Data System Project’. The Political Methodologist 14(1):28.Google Scholar
Schrodt, Philip A. 2015. ‘Event Data in Forecasting Models: Where Does it Come From, What Can It Do?’ Unpublished Manuscript.Google Scholar
Schrodt, Philip A., and Yonamine, James E.. 2012. ‘Automated Coding of Very Large Scale Political Event Data’. New Directions in Text as Data Workshop, Harvard.Google Scholar
Shellman, Stephen M. 2008. ‘Coding Disaggregated Intrastate Conflict: Machine Processing the Behavior of Substate Actors Over Time and Space’. Political Analysis 16(4):464477.Google Scholar
Steven, Bird, Klein, Ewan, and Loper, Edward. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Sebastopol, CA: O’Reilly Media.Google Scholar
Weidmann, Nils. 2015. ‘On the Accuracy of Media-Based Conflict Event Data’. Journal of Conflict Resolution 59(6):11291149.Google Scholar
Supplementary material: PDF

Lee et al. supplementary material

Online Appendix

Download Lee et al. supplementary material(PDF)
PDF 722.6 KB
Supplementary material: Link

Lee et al. Dataset

Link