Datasets for generic relation extraction*

B. HACHEY; C. GROVER; R. TOBIN

doi:10.1017/S1351324911000106

Datasets for generic relation extraction*

Published online by Cambridge University Press: 09 March 2011

B. HACHEY ,

C. GROVER and

R. TOBIN

Show author details

B. HACHEY: Affiliation:
Language Technology Group, Macquarie University, NSW 2109, Australia email: bhachey@cmcrc.com
C. GROVER: Affiliation:
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland email: C.Grover@ed.ac.uk; R.Tobin@ed.ac.uk
R. TOBIN: Affiliation:
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland email: C.Grover@ed.ac.uk; R.Tobin@ed.ac.uk

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

A vast amount of usable electronic data is in the form of unstructured text. The relation extraction task aims to identify useful information in text (e.g. PersonW works for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational database or RDF triplestore that can be more effectively used for querying and automated reasoning. A number of resources have been developed for training and evaluating automatic systems for relation extraction in different domains. However, comparative evaluation is impeded by the fact that these corpora use different markup formats and notions of what constitutes a relation. We describe the preparation of corpora for comparative evaluation of relation extraction across domains based on the publicly available ACE 2004, ACE 2005 and BioInfer data sets. We present a common document type using token standoff and including detailed linguistic markup, while maintaining all information in the original annotation. The subsequent reannotation process normalises the two data sets so that they comply with a notion of relation that is intuitive, simple and informed by the semantic web. For the ACE data, we describe an automatic process that automatically converts many relations involving nested, nominal entity mentions to relations involving non-nested, named or pronominal entity mentions. For example, the first entity is mapped from ‘one’ to ‘Amidu Berry’ in the membership relation described in ‘Amidu Berry, one half of PBS’. Moreover, we describe a comparably reannotated version of the BioInfer corpus that flattens nested relations, maps part-whole to part-part relations and maps n-ary to binary relations. Finally, we summarise experiments that compare approaches to generic relation extraction, a knowledge discovery task that uses minimally supervised techniques to achieve maximally portable extractors. These experiments illustrate the utility of the corpora.1

Type: Articles
Information: Natural Language Engineering , Volume 18 , Issue 1 , January 2012 , pp. 21 - 59

DOI: https://doi.org/10.1017/S1351324911000106 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agichtein, E. and Gravano, L. 2000. Snowball: extracting relations from large plain-text collections. In Proceedings of the 5th ACM Conference on Digital Libraries, pp. 85–94. New York, NY: ACM.CrossRef Google Scholar

Aone, C., Halverson, L., Hampton, T. and Ramos-Santacruz, M. 1998. SRA: description of the IE2 system used for MUC-7. In Proceedings of the 7th Message Understanding Conference (MUC-7), Columbia, MD. Gaithersburg: NIST.Google Scholar

Auer, S., Dietzold, S., Lehmann, J., Hellmann, S., and Aumueller, D. 2009. Triplify: light-weight linked data publication from relational databases. In Proceedings of the 18th International World Wide Web Conference, Madrid, Spain, pp. 621–30. New York, NY: ACM.CrossRef Google Scholar

Berry, M. W., Dumais, S. T. and O'Brien, G. W. 1995. Using linear algebra for intelligent information retrieval. SIAM Review 37 (4): 573–95.CrossRef Google Scholar

Bizer, C., Heath, T. and Berners-Lee, T. 2009. Linked data – the story so far. International Journal on Semantic Web and Information Systems 5 (3): 1–22.Google Scholar

Blei, D., Ng, A. Y. and Jordan, M. I. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3: 993–1022.Google Scholar

Brin, S. 1999. Extracting patterns and relations from the world wide web. In: Atzeni, P., Mendelzon, A., and Mecca, G. (eds.), The World Wide Web and Databases: Selected Papers from WebDB '98, pp. 172–83. Lecture Notes in Computer Science. Berlin: Springer.CrossRef Google Scholar

Bunescu, R., Ge, R., Kate, R. J., Marcotte, E. M., Mooney, R. J., Ramani, A. K., and Wong, Y. W. 2004. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine 33 (2): 139–55.CrossRef Google Scholar

Byrne, K. 2009. Populating the Semantic Web – Combining Text and Relational Databases as RDF Graphs. PhD thesis, University of Edinburgh.Google Scholar

Chinchor, N. 1998. Overview of MUC-7. In Proceedings of the 7th Message Understanding Conference. Gaithersburg, MD: NIST.Google Scholar

Cohen, K. B., Fox, L., Ogren, P. V. and Hunter, L. 2005. Corpus design for biomedical natural language processing. In Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pp. 38–45. Morristown, TN: ACL.Google Scholar

Cohen, K. B. and Hunter, L. 2006. A critical review of PASBio's argument structures for biomedical verbs. BMC Bioinformatics 7 (Suppl 3): S6.CrossRef Google Scholar PubMed

Conrad, J. G. and Utt, M. H. 1994. A system for discovering relationships by feature extraction from text databases. In Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 260–70. New York, NY: ACM.Google Scholar

Curran, J. R. and Clark, S. 2003. Investigating GIS and smoothing for maximum entropy taggers. In Proceedings of the 11th Meeting of the European Chapter of the Association for Computational Linguistics, pp. 91–8. Morristown, TN: ACL.Google Scholar

Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., and Weischedel, R. 2004. The automatic content extraction (ACE) program – tasks, data, and evaluation. In Proceedings of the 4th International Conference on Language Resources and Evaluation, pp. 837–40. Paris: ELDA.Google Scholar

Eckart, C. and Young, G. 1936. The approximation of one matrix by another of lower rank. Psychometrika 1 (3): 211–218.Google Scholar

Filatova, E. and Hatzivassiloglou, V. 2003. Marking atomic events in sets of related texts. In: Nicolov, N., Bontcheva, K., Angelova, G., and Mitkov, R (eds.), Recent Advances in Natural Language Processing III, pp. 247–56. Amsterdam, Netherlands: John Benjamins.Google Scholar

Ginter, F., Pyysalo, S., Björne, J., Heimonen, J., and Salakoski, T. 2007. BioInfer relationship annotation manual. Technical Report 806, Turku Centre for Computer Science.Google Scholar

Griffiths, T. L. and Steyvers, M. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101 (Suppl 1): 5228–5235.CrossRef Google Scholar PubMed

Grover, C., Matthews, M. and Tobin, R. 2006. Tools to address the interdependence between tokenisation and standoff annotation. In Proceedings of the EACL Workshop on Multi-dimensional Markup in Natural Language Processing, pp. 19–26. Morristown: ACL.Google Scholar

Hachey, B. 2009 a. Multi-document summarisation using generic relation extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 420–9. Morristown, TN: ACL.Google Scholar

Hachey, B. 2009 b. Towards Generic Relation Extraction. Ph.D. thesis, University of Edinburgh.Google Scholar

Hasegawa, T., Sekine, S. and Grishman, R. 2004. Discovering relations among named entities from large corpora. In Proceedings of the 42nd Annual Meeting of Association of Computational Linguistics, pp. 415–22. Morristown, TN: ACL.Google Scholar

Hasegawa, T., Sekine, S. and Grishman, R. 2005. Unsupervised paraphrase acquisition via relation discovery. Technical Report 05-012, Proteus Project, Computer Science Department, New York University.Google Scholar

Heimonen, J., Pyysalo, S., Ginter, F. and Salakoski, T. 2008. Complex-to-pairwise mapping of biological relationships using a semantic network representation. In Proceedings of the 3rd International Symposium on Semantic Mining in Biomedicine, pp. 45–52. Turku: Turku Centre for Computer Science Turku, Finland.Google Scholar

Johnson, H. L. Jr., Baumgartner, William A., Krallinger, M., Cohen, K. B., and Hunter, L. 2007. Corpus refactoring: a feasibility study. Journal of Biomedical Discovery and Collaboration 2: 4.CrossRef Google Scholar PubMed

Landauer, T. K., Foltz, P. W. and Laham, D. 1998. An introduction to latent semantic analysis. Discourse Processes 25 (2): 259–284.CrossRef Google Scholar

Linguistic Data Consortium (LDC) 2004 a. Annotation Guidelines for Entity Detection and Tracking (EDT). Philadelphia, PA: LDC. http://www.ldc.upenn.edu/Projects/ACE/docs/EnglishEDTV4-2-6.PDF Accessed 22 July 2008.Google Scholar

Linguistic Data Consortium (LDC) 2004 b. Annotation Guidelines for Relation Detection and Characterization (RDC). Philadelphia, PA: LDC. http://www.ldc.upenn.edu/Projects/ACE/docs/EnglishRDCV4-3-2.PDF. Accessed 22 July 2008.Google Scholar

Linguistic Data Consortium (LDC) 2005 a. ACE (Automatic Content Extraction) English Annotation Guidelines for Entities. Philadelphia, Pa: LDC. http://www.ldc.upenn.edu/Projects/ACE/docs/English-Entities-Guidelines_v5.6.1.pdf. Accessed 22 July 2008.Google Scholar

Linguistic Data Consortium (LDC) 2005 b. ACE (Automatic Content Extraction) English Annotation Guidelines for Relations. Philadelphia, PA: LDC. http://www.ldc.upenn.edu/Projects/ACE/docs/English-Relations-Guidelines_v5.8.3.pdf. Accessed 22 July 2008.Google Scholar

Lin, D. 1998. Dependency-based evaluation of MINIPAR. In Proceedings of the LREC Workshop Evaluation of Parsing Systems, pp. 317–30. Paris: ELDA.Google Scholar

Lin, D. and Pantel, P. 2001. Discovery of inference rules for question answering. Natural Language Engineering 7 (4): 343–360.CrossRef Google Scholar

Marcus, M. P., Marcinkiewicz, M. A. and Santorini, B. 1993. Building a large annotated corpus of English: the Penn treebank. Computational Linguistics 19 (2): 313–30. ISSN .Google Scholar

McDonald, R., Pereira, F., Kulick, S., Winters, S., Jin, Y., and White, P. 2005. Simple algorithms for complex relation extraction with applications to biomedical IE. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 491–8. Morristown, TN: ACL.Google Scholar

Minnen, G., Carroll, J. and Pearce, D. 2000. Robust, applied morphological generation. In Proceedings of the 1st International Natural Language Generation Conference, pp. 201–8. Morristown, TN: ACL.Google Scholar

Mitchell, A., Strassel, S., Huang, S. and Zakhary, R. 2005. ACE 2004 Multilingual Training Corpus. Philadelphia, PA: Linguistic Data Consortium.Google Scholar

Pustejovsky, J., Saurí, R., Castaño, J., Radev, D., Gaizauskas, R., Setzer, A., Sundheim, B., and Katz, G. 2004. Representing temporal and event knowledge for QA systems. In: Maybury, M. T. (ed.), New Directions in Question Answering, pp. 99–112. Menlo Park, CA: AAAI Press.Google Scholar

Pyysalo, S., Airola, A., Heimonen, J., Björne, J., Ginter, F., and Salakoski, T. 2008. Comparative analysis of five protein–protein interaction corpora. BMC Bioinformatics 9 (Suppl 3): S6.CrossRef Google Scholar PubMed

Pyysalo, S., Ginter, F., Heimonen, J., Björne, J., Boberg, J., Järvinen, J., and Salakoski, T. 2007. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics 8: 50.CrossRef Google Scholar PubMed

Rzhetsky, A., Iossifov, I., Koike, T., Krauthammer, M., Kra, P., Morris, M., Yu, H., Dubou, P. A., Weng, W., Wilbur, W. J., Hatzivassiloglou, V., and Friedman, C. 2004. Geneways: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. Journal of Biomedical Informatics 37 (1): 43–53.Google Scholar

Sekine, S. 2006. On-demand information extraction. In Proceedings of the COLING/ACL Main Conference Poster Sessions, pp. 731–8. Morristown, TN: ACL.CrossRef Google Scholar

Smith, D. A. 2002. Detecting and browsing events in unstructured text. In Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 73–80. New York, NY: ACM.Google Scholar

Swanson, D. R. (1986) Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine 30 (1): 7–18.Google Scholar

Turmo, J., Ageno, A. and Català, N. 2006. Adaptive information extraction. ACM Computing Surveys, 38 (2): 4.CrossRef Google Scholar

Walker, C., Strassel, S., Medero, J. and Maeda, K. 2006. ACE 2005 Multilingual Training Corpus. Philadelphia, PA: Linguistic Data Consortium.Google Scholar

Wattarujeekrit, T., Shah, P. and Collier, N. 2004. PASBio: predicate-argument structures for event extraction in molecular biology. BMC Bioinformatics 5: 155.Google Scholar

Article contents

Datasets for generic relation extraction*

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests