Skip to main content Accessibility help

Lost in Space: Geolocation in Event Data

  • Sophie J. Lee, Howard Liu and Michael D. Ward


Improving geolocation accuracy in text data has long been a goal of automated text processing. We depart from the conventional method and introduce a two-stage supervised machine-learning algorithm that evaluates each location mention to be either correct or incorrect. We extract contextual information from texts, i.e., N-gram patterns for location words, mention frequency, and the context of sentences containing location words. We then estimate model parameters using a training data set and use this model to predict whether a location word in the test data set accurately represents the location of an event. We demonstrate these steps by constructing customized geolocation event data at the subnational level using news articles collected from around the world. The results show that the proposed algorithm outperforms existing geocoders even in a case added post hoc to test the generality of the developed algorithm.



Hide All

Sophie J. Lee, Ph.D. Department of Political Science, Duke University, 140 Science Drive, Durham, North Carolina 27708, USA ( Howard Liu, Ph.D. Candidate, Department of Political Science, Duke University, 140 Science Drive, Durham, North Carolina 27708, USA ( Michael D. Ward, Professor of Political Science, Department of Political Science, Duke University, 140 Science Drive, Durham, North Carolina 27708, USA ( The authors would like to thank John Beieler, Patrick Brandt, Andrew B. Hall, Andrew Halterman, Jan H. Pierskalla, and Philip A. Schrodt, as well as members of Wardlab for their insights and comments on this project. The editors and reviewers of this journal provided helpful suggestions. M.W. acknowledges support from National Science Foundation (NSF) Award 1259266. To view supplementary material for this article, please visit



Hide All
Beieler, John. 2016. ‘Creating a Real-Time, Reproducible Event Dataset’. arXiv preprint arXiv:1612.00866., accessed 1 May 2017.
Berico Technologies. 2017. ‘CLAVIN’. Available at, accessed 1 May 2017.
Boschee, Elizabeth, Jennifer Lautenschlager, Sean O'Brien, Steve Shellman, James Starz, and Michael Ward. 2015. ‘ICEWS Coded Event Data’. Available at, accessed 1 May 2017.
Cederman, Lars-Erik, and Gleditsch, Kristian Skrede. 2009. ‘Introduction to Special Issue on “Disaggregating Civil War”’. Journal of Conflict Resolution 53(4):487495.
Chen, Xue-wen, and Jong Cheol Jeong. 2007. ‘Enhanced Recursive Feature Elimination’. Sixth International Conference on Machine Learning and Applications, IEEE. Cincinnati, OH. 13–15 December, 2007.
Cristianini, Nello, and Shawe-Taylor, John. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. New York, NY: Cambridge University Press.
Dasarthy, Belur V. 1990. Nearest Neighbor Pattern Classification Techniques. Hoboken, NJ: IEEE Computer Society Press.
D’Ignazio, Catherine, Rahul Bhargava, Ethan Zuckerman, and Luisa Beck. 2014. ‘Cliff-Clavin: Determining Geographic Focus for News Articles’. KDD ’14: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining., Association for Computing Machinery. New York, NY. 24 August, 2014.
D’Orazio, Vito, Steven Landis, Glenn Palmer, and Philip Schrodt. 2014. ‘Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines’. Political Analysis 22(2):224–42.
Feinerer, Ingo, Kurt Hornik, and Mike Wallace. 2016. ‘Package wordnet’. R package Version 0.1-11. Available at, accessed 1 May 2017.
Frank, Jonas, and Martinez-Vazquez, Jorge. 2014. ‘Decentralization and Infrastructure: From Gaps to Solutions’. Working Paper No. 14-05. Andrew Young School of Policy Studies, Georgia State University, International Center for Public Policy, Atlanta, GA.
Freund, Yoav, and Schapire, Robert. 1997. ‘A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting’. Journal of Computer and System Sciences 55:119139.
GeoNames . 2017., accessed 1 May 2017.
Gerner, Debora, Philip A. Schrodt, and Omur Yilmaz. 2009. ‘Conflict and Mediation Event Observations (CAMEO): An Event Data Framework for a Post Cold War World’. In: Jacob Bercovitch and Scott Sigmund Gartner (eds), International Conflict Mediation: New Approaches and Findings, 287--304. New York: Routledge.
Grimmer, Justin, and Stewart, Brandon M.. 2013. ‘Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts’. Political Analysis 21(3):267–297.
Halterman, Andrew. 2016. ‘Mordecai: Full Text Geoparsing and Event Geocoding’. The Journal of Open Source Software. Available at, accessed 20 February 2017.
Halterman, Andrew, and Beieler, John. 2014. ‘A New, Near-Real-Time Event Dataset and the Role of Versioning’. Available at, accessed 1 May 2017.
Han, Xianpei, Le Sun, and Jun Zhao. 2011. ‘Collective Entity Linking in Web Text: A Graph-based Method’. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 765–774. Beijing, China. 24 July, 2011.
Honnibal, Matthew. 2016. ‘Spacy Usage: Entity Recognition’. Available at, accessed 1 May 2017.
Jones, Zachary, and Fridolin Linder. 2015. ‘Exploratory Data Analysis Using Random Forests’. Prepared for the 73rd Annual MPSA Conference. Chicago, IL. 16--19 April, 2015.
King, Davis E. 2009. ‘Dlib-ml: A Machine Learning Toolkit’. Journal of Machine Learning Research 10:17551758.
Kuhn, Max. 2016. ‘A Short Introduction to the Caret Package’. R package Version 1.6.8. Available at, accessed 1 June 2018.
Lautenschlager, Jennifer, Starz, James, and Warfield, Ian. 2016. ‘A Statistical Approach to the Subnational Geolocation of Event Data’. In: Sae Schatz and Mark Hoffman (eds), Advances in Cross-Cultural Decision Making, Vol. 480, Advances in Intelligent Systems and Computing 333343. Cham, Switzerland: Springer International Publishing.
Lautenschlager, Jennifer, Steve Shellman, and Michael D. Ward. 2015. ‘ICEWS Coded Event Aggregations’, Harvard Dataverse Network. Version 1. Available at, accessed 1 June 2018.
Liaw, Andy, and Wiener, Matthew. 2002. ‘Classification and Regression by randomForest’. R news 2(3):1822.
Liaw, Andy, and Matthew Weiner. 2016. ‘Package randomForest’. R package version 4.6-12. Available at, accessed 1 May 2017.
Manning, Christopher D. Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. ‘The Stanford CoreNLP Natural Language Processing Toolkit’. Association for Computational Linguistics (ACL) System Demonstrations, 55–60. Available at, accessed 1 May 2017.
Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, Friedrich Leisch, Chih-Chung Chang, and Chih-Chen Lin. 2017. ‘e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien’. R Package Version 1.6.8. Available at, accessed 1 May 2017.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. 2013. ‘Distributed Representations of Words and Phrases and Their Compositionality’. In: Christopher Burges, Léon Bottou, Max Welling, Zoubin Ghahramani, and Kilian Weinberger (eds), Proceedings of Advances in Neural Information Processing Systems, Neural Information Processing Systems, 3111–3119. Stateline, NV. 5--10 December, 2013.
Morton, Thomas, Joern Kottmann, Jason Baldridge, and Gann Bierner. 2005. ‘OpenNLP: A Java-Based NLP Toolkit’. Available at, accessed 1 May 2017.
Müller, Berndt, and Reinhardt, Joachim. 2012. Neural Networks: An Introduction. Berlin: Springer Science & Business Media.
Murphy, Kevin P. 2006. ‘Naive Bayes Classifiers’. University of British Columbia. Available at, accessed 1 June, 2018.
OEDA. 2016. ‘Real Time Event Data/Phoenix’. Available at, accessed June 25 2018.
Penn State University Geo Vista Center. 2017. ‘Geo Txt’. Available at, accessed 1 May 2017.
Python Engine for Text Resolution And Related Coding Hierarchy (PETRARCH). 2017., accessed 1 May 2017.
Porter, Martin F. 1980. ‘An Algorithm for Suffix Stripping’. Program 14(3):130137.
Rao, Delip, Paul McNamee, and Mark Dredze. 2013. ‘Entity Linking: Finding Extracted Entities in a Knowledge Base’. In: Thierry Poibeau, Horacio Saggion, Jakub Piskorski, and Roman Yangarber (eds), Multi-Source, Multilingual Information Extraction and Summarization, 93–115. New York: Springer Science and Business & Media.
Schrodt, Philip A. 2006. ‘Twenty Years of the Kansas Event Data System Project’. The Political Methodologist 14(1):28.
Schrodt, Philip A. 2015. ‘Event Data in Forecasting Models: Where Does it Come From, What Can It Do?’ Unpublished Manuscript.
Schrodt, Philip A., and Yonamine, James E.. 2012. ‘Automated Coding of Very Large Scale Political Event Data’. New Directions in Text as Data Workshop, Harvard.
Shellman, Stephen M. 2008. ‘Coding Disaggregated Intrastate Conflict: Machine Processing the Behavior of Substate Actors Over Time and Space’. Political Analysis 16(4):464477.
Steven, Bird, Klein, Ewan, and Loper, Edward. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Sebastopol, CA: O’Reilly Media.
Weidmann, Nils. 2015. ‘On the Accuracy of Media-Based Conflict Event Data’. Journal of Conflict Resolution 59(6):11291149.
Type Description Title
Supplementary materials

Lee et al. supplementary material
Online Appendix

 PDF (723 KB)
723 KB
Supplementary materials

Lee et al. Dataset



Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed