Automated dictionary generation for political eventcoding

Benjamin J. Radford

doi:10.1017/psrm.2019.1

Automated dictionary generation for political eventcoding

Published online by Cambridge University Press: 27 March 2019

Benjamin J. Radford

Show author details

Benjamin J. Radford*: Affiliation:
LevelUp Research, LLC, Arlington, VirginiaUS
*: *Corresponding author. Email: benjamin.radford@gmail.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Event data provide high-resolution and high-volume information about political events and have supported a variety of research efforts across fields within and beyond political science. While these datasets are machine coded from vast amounts of raw text input, the necessary dictionaries require substantial prior knowledge and human effort to produce and update, effectively limiting the application of automated event-coding solutions to those domains for which dictionaries already exist. I introduce a novel method for generating dictionaries appropriate for event coding given only a small sample dictionary. This technique leverages recent advances in natural language processing and machine learning to reduce the prior knowledge and researcher-hours required to go from defining a new domain-of-interest to producing structured event data that describe that domain. I evaluate the method with the production of a novel event dataset on cybersecurity incidents.

Keywords

Mathematical modeling measurement text and content analysis

Type: Original Article
Information: Political Science Research and Methods , Volume 9 , Issue 1 , January 2021 , pp. 157 - 171

DOI: https://doi.org/10.1017/psrm.2019.1 [Opens in a new window]
Copyright: Copyright © The European Political Science Association 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Akamai (2015) State of the internet—security. 2, no. 4 (Q4).Google Scholar

Althaus, SL, Bajjalieh, J, Carter, JF, Peyton, B and Shalmon, DA (2017) Cline center historical event data. June 26. https://www.clinecenter.illinois.edu/data/event/phoenix.Google Scholar

Arora, S, Li, Y, Liang, Y, Ma, T and Risteski, A (2016) Rand-walk: a latent variable model approach to word embeddings. arXiv:1502.03520v7 (July 22).Google Scholar

Azar, EE (1980) The conflict and peace data bank (copdab) project. The Journal of Conflict Resolution 24, (April).CrossRef Google Scholar

Bauer, J (2014) Shift-reduce constituency parser [in English]. The Stanford Natural Language Processing Group. Online. http://nlp.stanford.edu/software/srparser.shtml.Google Scholar

BBC (2015) Bangladeshi secular publisher hacked to death. Online. http://www.bbc.co.uk/news/world-asia-34688245 October 31.Google Scholar

Bojanowski, P, Grave, E, Joulin, A and Mikolov, T (2016) Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.Google Scholar

Boschee, E, Lautenschlager, J, O'Brien, S, Shellman, S, Starz, J and Ward, M (2015) Icews coded event data. V15. http://dx.doi.org/10.7910/DVN/28075.CrossRef Google Scholar

Brecher, M and Wilkenfeld, J (2000) A Study of Crisis. Ann Arbor, Michigan. University of Michigan Press.Google Scholar

Brecher, M, Wilkenfeld, J, Beardsley, K, James, P and Quinn, D (2016) International crisis behavior data codebook, version 11. http://sites.duke.edu/icbdata/data-collections.Google Scholar

Caerus Associates (2015) Phoenix event data set codebook 0.0.1b. https://s3.amazonaws.com/oeda/docs/phoenix_codebook.pdf.Google Scholar

Caliskan, A, Bryson, JJ and Narayanan, A (2017) Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186. ISSN: 0036-8075. doi: 10.1126/science.aal4230. eprint: http://science.sciencemag.org/content/356/6334/183.full.pdf http://science.sciencemag.org/content/356/6334/183.CrossRef Google Scholar PubMed

Cimpanu, C (2015) Military contractors that used Russian programmers for dod software get fined by US govt. http://news.softpedia.com/news/military-contractors-that-used-russian-programmers-for-dod-software-get-fined-by-us-govt-495827.shtml. Softpedia Security News. (November 6).Google Scholar

Clapper, JR (2015) Worldwide threat assessment of the us intelligence community. http://cdn.arstechnica.net/wp-content/uploads/2015/02/Clapper_02-26-15.pdf. Senate Armed Services Committee, February 26, 2015.Google Scholar

Dhillon, PS, Foster, DP and Ungar, LH (2015) Eigenwords: Spectral word embeddings. Journal of Machine Learning Research 16, 3035–3078. http://www.pdhillon.com/dhillon15a.pdf.Google Scholar

Dunning, T (1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 61–74.Google Scholar

Finkel, JR, Grenager, T and Manning, C (2005) Incorporating non-local information into information extraction systems by gibbs sampling. Proceedings of the 43rd Annual Meeting of the Association of Computational Linguistics, pp. 363–370. http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf.CrossRef Google Scholar

Goldberg, Y and Levy, O (2014) word2vec explained: deriving mikolov et al.'s negative-sampling word-embedding method. arXiv (February).Google Scholar

Gouws, S, Bengio, Y and Corrado, G (2015) Bilbowa: fast bilingual distributed representations without word alignments. Proceedings of the 32nd International Conference on Machine Learning 37, 748–756.Google Scholar

Greenwald, AG, McGhee, DE and Schwartz, JL (1998) Measuring individual differences in implicit cognition: the implicit association test. Journal of Personality and Social Psychology 74, 1464–80.CrossRef Google Scholar PubMed

Harris, ZS (1954) Distributional structure. WORD 10, 146–162.CrossRef Google Scholar

Herzog, A, Shahmehri, N and Duma, C (2007) An ontology of information security. International Journal of Information Security and Privacy 1, 1–23.CrossRef Google Scholar

IARPA (2018) Draft Broad Agency Announcement: Better Extraction from Text Towards Enhanced Retrieval (BETTER). IARPA-BAA-18-05.Google Scholar

King, G and Lowe, W (2003) An automated information extraction tool for international conflict data with performance as good as human coders: a rare events evaluation design. International Organization 57, 617–642. http://gking.harvard.edu/files/gking/files/infoex.pdf?m=1360039060.CrossRef Google Scholar

King, G, Lam, P and Roberts, ME (2017) Computer-assisted keyword and document set discovery from unstructured text. American Journal of Political Science 61, 971–988.CrossRef Google Scholar

Kovacs, E (2012a) Iranian official: we did not launch cyberattacks on American banks. Softpedia Security News (September 24). http://news.softpedia.com/news/Iranian-Officials-We-Did-Not-Launch-Cyberattacks-on-American-Banks-294412.shtml.Google Scholar

Kovacs, E (2012b) Tick tock: It's lights out for DNS changer-infected computers on July 9. Softpedia Security News (June 6). http://news.softpedia.com/news/Tick-Tock-It-s-Lights-Out-for-DNSChanger-Infected-Computers-on-July-9-Video-279700.shtml.Google Scholar

Kovacs, E (2013) Hundreds of sites hacked in conflict between Malaysia and Philippines hacktivists. Softpedia Security News (March 4). http://news.softpedia.com/news/Hundreds-of-Sites-Hacked-in-Conflict-Between-Malaysia-and-Philippines-Hacktivists-334047.shtml.Google Scholar

Le, Q and Mikolov, T (2014) Distributed representations of sentences and documents. Proceedings of the 31^st International Conference on Machine Learning.Google Scholar

Leetaru, K and Schrodt, PA (2013) GDELT: Global Data on Events, Location and Tone, 1970–2012. Annual Meeting of the International Studies Association. http://data.gdeltproject.org/documentation/ISA.2013.GDELT.pdf.Google Scholar

Mikolov, T, Chen, K, Corrado, G and Dean, J (2013) Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR. http://arxiv.org/pdf/1301.3781.pdf.Google Scholar

Mikolov, T, Sutskever, I, Chen, K, Corrado, G and Dean, J (2013) Distributed representations of words and phrases and their compositionality. arXiv (October). http://arxiv.org/abs/1310.4546.Google Scholar

Nguyen, KA, im Walde, SS and Vu, NT (2016) Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 454–459.CrossRef Google Scholar

Norris, C, Schrodt, P and Beieler, J (2017) PETRARCH2: Another event coding program. The Journal of Open Source Software 2, (January), 1–1. doi: 10.21105/joss.00133, http://dx.doi.org/10.21105/joss.00133.CrossRef Google Scholar

Open Event Data Alliance (2015a) PETRARCH Python Engine for Text Resolution and Related Coding Hierarchy. Online. http://www.github.com/openeventdata/petrarch.Google Scholar

Open Event Data Alliance (2015b) Phoenix Data Project. Online. phoenixdata.org.Google Scholar

Open Event Data Alliance (2015c) Phoenix Pipeline. Online. http://phoenix-pipeline.readthedocs.org/en/latest.Google Scholar

Open Event Data Alliance (2018) Universal Dependency PETRARCH. https://github.com/openeventdata/UniversalPetrarch.Google Scholar

Palmer, G, D'Orazio, V, Kenwich, M and Lane, M (2015) The MID4 dataset, 2002–2010: Procedures, coding rules and description. Conflict Management and Peace Science 32, 222–242.CrossRef Google Scholar

Pan, SJ and Yang, Q (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 1345–1359.CrossRef Google Scholar

Pennington, J, Socher, R and Manning, CD (2014) GloVe: Global Vectors for Word Representation, In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543.Google Scholar

Raleigh, C, Linke, A, Hegre, H and Karlsen, J (2010) Introducing ACLED – Armed Conflict Location and Event Data. Journal of Peace Research 47, 651–660.CrossRef Google Scholar

Rehurek, R and Sojka, P (2010) Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta: ELRA, May, pp. 45–50 http://is.muni.cz/publication/884893/en.Google Scholar

Rubenstein, H and Goodenough, JB (1965) Contextual Correlates of Synonymy. In Communications of the ACM vol. 8(10), 627–633. ACM, New York, NY.CrossRef Google Scholar

Santorini, B (1990) Part-of-speech tagging guidelines for the penn treebank project. https://www.cis.upenn.edu/~treebank/.Google Scholar

Schrodt, PA (1998) KEDS Kansas Event Data System version 1.0. http://eventdata.parusanalytics.com/.Google Scholar

Schrodt, PA (2011) TABARI: Textual Analysis by Augmented Replacement Instructions, Version 0.7.6. http://eventdata.parusanalytics.com/tabari.dir/tabari.manual.0.7.6.pdf.Google Scholar

Schrodt, PA and Brackle, DV (2013) Automated Coding of Political Event Data. In Subrahmaniam, VS (ed.) Handbook of Computational Approaches to Counterterrorism. Springer Science + Business Media. New York, NY.CrossRef Google Scholar

Schrodt, PA, Gerner, DJ and Yilmaz, O (2009) Conflict and Mediation Event Observations (CAMEO): an event data framework for a post Cold War world. International Conflict Mediation: New Approaches and Findings.Google Scholar

Swimmer, M (2008) Towards and ontology of malware classes. https://www.scribd.com/document/24058261/Towards-an-Ontology-of-Malware-Classes.Google Scholar

The Economist (2012) Hype and fear: America is leading the way in developing doctrines for cyber-warfare. other countries may follow, but the value of offensive cyber capabilities is overrated. The Economist (December 8).Google Scholar

The GDELT Project (2016) The Datasets of GDELT as of February 2016. https://blog.gdeltproject.org/the-datasets-of-gdelt-as-of-february-2016. March 13.Google Scholar

Valeriano, B and Maness, RC (2014) The dynamics of cyber conflict between rival antagonists, 2001–11. Journal of Peace Research 51(3), 347–360.CrossRef Google Scholar

Volz, D and Finkle, J (2016) US indicts iranians for hacking dozens of banks, New York dam. Reuters. http://www.reuters.com/article/us-usa-iran-cyber-idUSKCN0WQ1JF.Google Scholar

Wang, W, Kennedy, R, Lazer, D and Ramakrishnan, N (2016) Growing pains for global monitoring of societal events. Science Magazine Digital, 1502–1503. (September 30).CrossRef Google Scholar

Ward, MD, Beger, A, Cutler, J, Dickenson, M, Dorff, C and Radford, B (2013) Comparing GDELT and ICEWS event data. http://mdwardlab.com/sites/default/files/GDELTICEWS_0.pdf.Google Scholar

Radford Dataset

Dataset

https://doi.org/10.7910/DVN/N3Y3MF

Link

Radford supplementary material

Online Appendix

PDF 225.1 KB

Article contents

Automated dictionary generation for political eventcoding

Abstract

Keywords

Access options

References

Radford Dataset

Radford supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests