Skip to main content Accessibility help
×
×
Home

Automated dictionary generation for political eventcoding

  • Benjamin J. Radford (a1)
Abstract

Event data provide high-resolution and high-volume information about political events and have supported a variety of research efforts across fields within and beyond political science. While these datasets are machine coded from vast amounts of raw text input, the necessary dictionaries require substantial prior knowledge and human effort to produce and update, effectively limiting the application of automated event-coding solutions to those domains for which dictionaries already exist. I introduce a novel method for generating dictionaries appropriate for event coding given only a small sample dictionary. This technique leverages recent advances in natural language processing and machine learning to reduce the prior knowledge and researcher-hours required to go from defining a new domain-of-interest to producing structured event data that describe that domain. I evaluate the method with the production of a novel event dataset on cybersecurity incidents.

Copyright
Corresponding author
*Corresponding author. Email: benjamin.radford@gmail.com
References
Hide All
Akamai (2015) State of the internet—security. 2, no. 4 (Q4).
Althaus, SL, Bajjalieh, J, Carter, JF, Peyton, B and Shalmon, DA (2017) Cline center historical event data. June 26. https://www.clinecenter.illinois.edu/data/event/phoenix.
Arora, S, Li, Y, Liang, Y, Ma, T and Risteski, A (2016) Rand-walk: a latent variable model approach to word embeddings. arXiv:1502.03520v7 (July 22).
Azar, EE (1980) The conflict and peace data bank (copdab) project. The Journal of Conflict Resolution 24, (April).
Bauer, J (2014) Shift-reduce constituency parser [in English]. The Stanford Natural Language Processing Group. Online. http://nlp.stanford.edu/software/srparser.shtml.
BBC (2015) Bangladeshi secular publisher hacked to death. Online. http://www.bbc.co.uk/news/world-asia-34688245 October 31.
Bojanowski, P, Grave, E, Joulin, A and Mikolov, T (2016) Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
Boschee, E, Lautenschlager, J, O'Brien, S, Shellman, S, Starz, J and Ward, M (2015) Icews coded event data. V15. http://dx.doi.org/10.7910/DVN/28075.
Brecher, M and Wilkenfeld, J (2000) A Study of Crisis. Ann Arbor, Michigan. University of Michigan Press.
Brecher, M, Wilkenfeld, J, Beardsley, K, James, P and Quinn, D (2016) International crisis behavior data codebook, version 11. http://sites.duke.edu/icbdata/data-collections.
Caerus Associates (2015) Phoenix event data set codebook 0.0.1b. https://s3.amazonaws.com/oeda/docs/phoenix_codebook.pdf.
Caliskan, A, Bryson, JJ and Narayanan, A (2017) Semantics derived automatically from language corpora contain human-like biases. Science 356, 183186. ISSN: 0036-8075. doi: 10.1126/science.aal4230. eprint: http://science.sciencemag.org/content/356/6334/183.full.pdf http://science.sciencemag.org/content/356/6334/183.
Cimpanu, C (2015) Military contractors that used Russian programmers for dod software get fined by US govt. http://news.softpedia.com/news/military-contractors-that-used-russian-programmers-for-dod-software-get-fined-by-us-govt-495827.shtml. Softpedia Security News. (November 6).
Clapper, JR (2015) Worldwide threat assessment of the us intelligence community. http://cdn.arstechnica.net/wp-content/uploads/2015/02/Clapper_02-26-15.pdf. Senate Armed Services Committee, February 26, 2015.
Dhillon, PS, Foster, DP and Ungar, LH (2015) Eigenwords: Spectral word embeddings. Journal of Machine Learning Research 16, 30353078. http://www.pdhillon.com/dhillon15a.pdf.
Dunning, T (1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 6174.
Finkel, JR, Grenager, T and Manning, C (2005) Incorporating non-local information into information extraction systems by gibbs sampling. Proceedings of the 43rd Annual Meeting of the Association of Computational Linguistics, pp. 363370. http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf.
Goldberg, Y and Levy, O (2014) word2vec explained: deriving mikolov et al.'s negative-sampling word-embedding method. arXiv (February).
Gouws, S, Bengio, Y and Corrado, G (2015) Bilbowa: fast bilingual distributed representations without word alignments. Proceedings of the 32nd International Conference on Machine Learning 37, 748756.
Greenwald, AG, McGhee, DE and Schwartz, JL (1998) Measuring individual differences in implicit cognition: the implicit association test. Journal of Personality and Social Psychology 74, 1464–80.
Harris, ZS (1954) Distributional structure. WORD 10, 146162.
Herzog, A, Shahmehri, N and Duma, C (2007) An ontology of information security. International Journal of Information Security and Privacy 1, 123.
IARPA (2018) Draft Broad Agency Announcement: Better Extraction from Text Towards Enhanced Retrieval (BETTER). IARPA-BAA-18-05.
King, G and Lowe, W (2003) An automated information extraction tool for international conflict data with performance as good as human coders: a rare events evaluation design. International Organization 57, 617642. http://gking.harvard.edu/files/gking/files/infoex.pdf?m=1360039060.
King, G, Lam, P and Roberts, ME (2017) Computer-assisted keyword and document set discovery from unstructured text. American Journal of Political Science 61, 971988.
Kovacs, E (2012a) Iranian official: we did not launch cyberattacks on American banks. Softpedia Security News (September 24). http://news.softpedia.com/news/Iranian-Officials-We-Did-Not-Launch-Cyberattacks-on-American-Banks-294412.shtml.
Kovacs, E (2012b) Tick tock: It's lights out for DNS changer-infected computers on July 9. Softpedia Security News (June 6). http://news.softpedia.com/news/Tick-Tock-It-s-Lights-Out-for-DNSChanger-Infected-Computers-on-July-9-Video-279700.shtml.
Kovacs, E (2013) Hundreds of sites hacked in conflict between Malaysia and Philippines hacktivists. Softpedia Security News (March 4). http://news.softpedia.com/news/Hundreds-of-Sites-Hacked-in-Conflict-Between-Malaysia-and-Philippines-Hacktivists-334047.shtml.
Le, Q and Mikolov, T (2014) Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning.
Leetaru, K and Schrodt, PA (2013) GDELT: Global Data on Events, Location and Tone, 1970–2012. Annual Meeting of the International Studies Association. http://data.gdeltproject.org/documentation/ISA.2013.GDELT.pdf.
Mikolov, T, Chen, K, Corrado, G and Dean, J (2013) Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR. http://arxiv.org/pdf/1301.3781.pdf.
Mikolov, T, Sutskever, I, Chen, K, Corrado, G and Dean, J (2013) Distributed representations of words and phrases and their compositionality. arXiv (October). http://arxiv.org/abs/1310.4546.
Nguyen, KA, im Walde, SS and Vu, NT (2016) Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 454459.
Norris, C, Schrodt, P and Beieler, J (2017) PETRARCH2: Another event coding program. The Journal of Open Source Software 2, (January), 11. doi: 10.21105/joss.00133, http://dx.doi.org/10.21105/joss.00133.
Open Event Data Alliance (2015a) PETRARCH Python Engine for Text Resolution and Related Coding Hierarchy. Online. http://www.github.com/openeventdata/petrarch.
Open Event Data Alliance (2015b) Phoenix Data Project. Online. phoenixdata.org.
Open Event Data Alliance (2015c) Phoenix Pipeline. Online. http://phoenix-pipeline.readthedocs.org/en/latest.
Open Event Data Alliance (2018) Universal Dependency PETRARCH. https://github.com/openeventdata/UniversalPetrarch.
Palmer, G, D'Orazio, V, Kenwich, M and Lane, M (2015) The MID4 dataset, 2002–2010: Procedures, coding rules and description. Conflict Management and Peace Science 32, 222242.
Pan, SJ and Yang, Q (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 13451359.
Pennington, J, Socher, R and Manning, CD (2014) GloVe: Global Vectors for Word Representation, In Empirical Methods in Natural Language Processing (EMNLP), pp. 15321543.
Raleigh, C, Linke, A, Hegre, H and Karlsen, J (2010) Introducing ACLED – Armed Conflict Location and Event Data. Journal of Peace Research 47, 651660.
Rehurek, R and Sojka, P (2010) Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta: ELRA, May, pp. 4550 http://is.muni.cz/publication/884893/en.
Rubenstein, H and Goodenough, JB (1965) Contextual Correlates of Synonymy. In Communications of the ACM vol. 8(10), 627633. ACM, New York, NY.
Santorini, B (1990) Part-of-speech tagging guidelines for the penn treebank project. https://www.cis.upenn.edu/~treebank/.
Schrodt, PA (1998) KEDS Kansas Event Data System version 1.0. http://eventdata.parusanalytics.com/.
Schrodt, PA (2011) TABARI: Textual Analysis by Augmented Replacement Instructions, Version 0.7.6. http://eventdata.parusanalytics.com/tabari.dir/tabari.manual.0.7.6.pdf.
Schrodt, PA and Brackle, DV (2013) Automated Coding of Political Event Data. In Subrahmaniam, VS (ed.) Handbook of Computational Approaches to Counterterrorism. Springer Science + Business Media. New York, NY.
Schrodt, PA, Gerner, DJ and Yilmaz, O (2009) Conflict and Mediation Event Observations (CAMEO): an event data framework for a post Cold War world. International Conflict Mediation: New Approaches and Findings.
Swimmer, M (2008) Towards and ontology of malware classes. https://www.scribd.com/document/24058261/Towards-an-Ontology-of-Malware-Classes.
The Economist (2012) Hype and fear: America is leading the way in developing doctrines for cyber-warfare. other countries may follow, but the value of offensive cyber capabilities is overrated. The Economist (December 8).
The GDELT Project (2016) The Datasets of GDELT as of February 2016. https://blog.gdeltproject.org/the-datasets-of-gdelt-as-of-february-2016. March 13.
Valeriano, B and Maness, RC (2014) The dynamics of cyber conflict between rival antagonists, 2001–11. Journal of Peace Research 51(3), 347360.
Volz, D and Finkle, J (2016) US indicts iranians for hacking dozens of banks, New York dam. Reuters. http://www.reuters.com/article/us-usa-iran-cyber-idUSKCN0WQ1JF.
Wang, W, Kennedy, R, Lazer, D and Ramakrishnan, N (2016) Growing pains for global monitoring of societal events. Science Magazine Digital, 15021503. (September 30).
Ward, MD, Beger, A, Cutler, J, Dickenson, M, Dorff, C and Radford, B (2013) Comparing GDELT and ICEWS event data. http://mdwardlab.com/sites/default/files/GDELTICEWS_0.pdf.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Political Science Research and Methods
  • ISSN: 2049-8470
  • EISSN: 2049-8489
  • URL: /core/journals/political-science-research-and-methods
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Keywords

Type Description Title
UNKNOWN
Supplementary materials

Radford Dataset
Dataset

 Unknown
PDF
Supplementary materials

Radford supplementary material
Online Appendix

 PDF (225 KB)
225 KB

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed