DEXTER: A workbench for automatic term extraction with specialized corpora†

CARLOS PERIÑAN-PASCUAL

doi:10.1017/S1351324917000365

DEXTER: A workbench for automatic term extraction with specialized corpora†

Published online by Cambridge University Press: 05 October 2017

CARLOS PERIÑAN-PASCUAL

Show author details

CARLOS PERIÑAN-PASCUAL*: Affiliation:
Applied Linguistics Department, Universitat Politècnica de València, Paranimf, 1; 46730 Gandia, Valencia, Spain

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Automatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.

Type: Articles
Information: Natural Language Engineering , Volume 24 , Issue 2 , March 2018 , pp. 163 - 198

DOI: https://doi.org/10.1017/S1351324917000365 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

†

Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grant FFI2014-53788-C3-1-P.

References

Ahmad, K., Gillam, L., and Tostevin, L. 2000. Weirdness indexing for logical document extrapolation and retrieval (WILDER). In E. M. Voorhees, and D. K Harman (eds.), Proceedings of the 8th Text Retrieval Conference, pp. 717–724. Washington: National Institute of Standards and Technology.Google Scholar

Ahrenberg, L. 2009. Term extraction: A review. Retrieved from http://www.ida.liu.se/~lah/Publications/tereview_v2.pdf Google Scholar

Alajmi, A., Saad, E. M., and Darwish, R. R., 2012. Toward an ARABIC stop-words list generation. Int. J. Comput. Appl. 46 (8): 8–13.Google Scholar

Asubiaro, T. V., 2013. Entropy-based generic stopwords list for Yoruba texts. Int. J. Comput. Inform. Technol. 2 (5): 1065–1068.Google Scholar

Barcala, M., Domínguez-Noya, E., Gamallo, P., López, M., Moscoso, E., Rojo, G., Santalla, P., and Sotelo, S. 2007. A corpus and lexical resources for multi-word terminology extraction in the field of economy. In Proceedings of the 3rd Language and Technology Conference, Poznan, pp. 355–359.Google Scholar

Biemann, C., Heyer, G., Quasthoff, U., and Richter, M. 2007. The Leipzig Corpora Collection: monolingual corpora of standard size. In Proceedings of Corpus Linguistic 2007, Birmingham.Google Scholar

Brants, T. 2004. Natural language processing in information retrieval. In Proceedings of the 14th Meeting of Computational Linguistics, Antwerp, pp. 1–13.Google Scholar

Church, K. W., Gale, W., Hanks, P., and Hindle, D. 1991. Using statistics in lexical analysis. In Zernik, U., (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pp. 115–164. Hillsdale: Lawrence Erlbaum Associates.Google Scholar

Church, K. W. and Hanks, P., 1990. Word association norms, mutual information and lexicography. Computational Linguistics 6 (1): 22–29.Google Scholar

Conde, A., Larrañaga, M., Arruarte, A., Elorriaga, J. A., and Roth, D., 2016. LiteWi: a combined term extraction method for eliciting educational ontologies from textbooks. Journal of the Association for Information Science and Technology 67 (2): 380–399.Google Scholar

Conrado, M. S., Felippo, A., Pardo, T. A. S., and Rezende, S. O., 2014. A survey of automatic term extraction for Brazilian Portuguese. Journal of the Brazilian Computer Society 20 (12): 1–28.Google Scholar

Deane, P. 2005. A nonparametric method for extraction of candidate phrasal terms. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics. Michigan: Association for Computer Linguistics, pp. 605–613.Google Scholar

Drouin, P., 2003. Term extraction using non-technical corpora as a point of leverage. Terminology 9 (1): 99–117.Google Scholar

Dunning, T., 1994. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19 (1): 61–74.Google Scholar

Everitt, B., 1992. The Analysis of Contingency Tables. London: Chapman and Hall/CRC.Google Scholar

Fedorenko, D., Astrakhantsev, N., and Turdakov, D. 2013. Automatic recognition of domain-specific terms: an experimental evaluation. In Proceedings of the 9th Spring Researcher’s Colloquium on Database and Information Systems, pp. 15–23.Google Scholar

Fox, C., 1990. A stop list for general text. ACM-SIGIR Forum 24 : 19–35.Google Scholar

Francis, W. N., and Kučera, H., 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Boston: Houghton Mifflin.Google Scholar

Frantzi, K., and Ananiadou, S. 1996. Extracting nested collocations. In Proceedings of the 16th International Conference on Computational Linguistics. Morristown: Association for Computational Linguistics, pp. 41–46.Google Scholar

Frantzi, K., Ananiadou, S., and Mima, H., 2000. Automatic recognition of multi-word terms. International Journal of Digital Libraries 3 (2): 117–132.Google Scholar

Gale, W., and Church, K. W. 1991. Concordances for parallel texts. In Proceedings of the 7th Annual Conference of the UW Center for the New OED and Text Research, Oxford, pp. 40–62.Google Scholar

Haan, P. 1992. The optimum corpus sample size? In Leitner, G. (ed.), New Dimensions in English Language Corpora, pp. 3–19. Berlin-NewYork: Mouton de Gruyter.Google Scholar

Harman, D. 1986. An experimental study of factors important in document ranking. In Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, pp. 186–193.Google Scholar

Hatcher, E., Gospodnetic, O., and McCandless, M., 2010. Lucene in Action. Greenwich: Manning.Google Scholar

Hunston, S. 2008. Collection strategies and design decisions. In Lüdeling, A., and Kytö, M. (eds.), Corpus Linguistics: An International Handbook, vol. 1, pp. 154–168. Berlin-New York: Mouton de Gruyter.Google Scholar

ISO 704.,2009. Terminology Work – Principles and Methods. Geneva: International Organization for Standardization.Google Scholar

Ittoo, A., Maruster, L., Wortmann, H., and Bouma, G. 2010. Textractor: a framework for extracting relevant domain concepts from irregular corporate textual datasets. In Abramowicz, W., and Tolksdorf, R. (eds.), Business Information Systems. Lecture Notes in Business Information Processing, vol. 47, pp. 71–82. Heidelberg: Springer.CrossRef Google Scholar

Jacquey, E., Tutin, A., Kister, L., Jacques, M., Hatier, S., and Ollinger, S. 2013. Filtrage terminologique par le lexique transdisciplinaire scientifique: une expérimentation en sciences humaines. In Proceedings of the 10th International Conference on Terminology and Artificial Intelligence (TIA 2013). Villetaneuse, pp. 121–128.Google Scholar

Justeson, J. S., and Katz, S. M., 1995. Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1 (1): 9–27.Google Scholar

Kageura, K., and Umino, B., 1996. Methods of automatic term recognition: A review. Terminology 3 (2): 259–289.Google Scholar

Karystianis, G., Buchan, I., and Nenadic, G. 2014. Mining characteristics of epidemiological studies from Medline: a case study in obesity. Journal of Biomedical Semantics 5, 22: 1–11.Google Scholar

Khosrow-Pour, M., 2009. Encyclopedia of Information Science and Technology. Hershey: Information Science Reference.Google Scholar

Knoth, P., Schmidt, M., Smrz, P., and Zdráhal, Z. 2009. Towards a framework for comparing automatic term recognition methods. In Proceedings of the 8th Annual Conference Znalosti. Bratislava: Informatics and Information Technology STU, pp. 83–94.Google Scholar

Koester, A. 2010. Building small specialized corpora. In O’Keeffe, A., and McCarthy, M. (eds.), The Routledge Handbook of Corpus Linguistics, pp. 66–79. London: Routledge.Google Scholar

Korkontzelos, I., Klapaftis, I., and Manandhar, S. 2008. Reviewing and evaluating automatic term recognition techniques. In Proceedings of the 6th International Conference on Advances in Natural Language Processing. Berlin-Heidelberg: Springer, pp. 248–259.Google Scholar

Lochbaum, K. E., and Streeter, L. A., 1989. Comparing and combining the effectiveness of latent semantic indexing and the ordinary vector space model for information retrieval. Information Processing and Management 25 (6): 665–676.Google Scholar

Lossio-Ventura, J. A., Jonquet, C., Roche, M., and Teisseire, M. 2014a. BioTex: a system for biomedical terminology extraction, ranking and validation. In Proceedings of the 13th International Semantic Web Conference, pp. 157–160.Google Scholar

Lossio-Ventura, J. A., Jonquet, C., Roche, M., and Teisseire, M., 2014a. Towards a mixed approach to extract biomedical terms from text corpus. International Journal of Knowledge Discovery in Bioinformatics 4 (1): 1–15.Google Scholar

Lossio-Ventura, J. A., Jonquet, C., Roche, M., and Teisseire, M. 2014c. Yet another ranking function to automatic multi-word term extraction. In Proceedings of the 9th International Conference on Natural Language Processing, Warsaw.Google Scholar

Luhn, H. P., 1958. The automatic creation of literature abstracts. IBM Journal of Research and Development 2 (2): 159—165.Google Scholar

Marín, M. J., 2015. Measuring precision in legal term mining: a corpus-based validation of single and multi-word term recognition methods. ESP World 46 : 1–23.Google Scholar

Merkel, M., Foo, J., and Ahrenberg, L. 2013. IPhraxtor – a linguistically informed system for extraction of term candidates. In Proceedings of the 19th Nordic Conference on Computational Linguistics, pp. 121–132. Oslo: Linkoping University Electronic Press.Google Scholar

Meyers, A., He, Y., Glass, Z., and Babko-Malaya, O. 2015. The Termolator: terminology recognition based on chunking, statistical and search-based scores. In Proceedings of the First Workshop on Mining Scientific Papers: Computational Linguistics and Bibliometrics, Istanbul, pp. 34–43.Google Scholar

Nagao, M., Mizutani, M., and Ikeda, H., 1976. An automated method of the extraction of important words from Japanese scientific documents. Transactions of the Information Processing Society of Japan 17 (2): 110–117.Google Scholar

Oakes, M., 1998. Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.Google Scholar

Park, Y., Byrd, R. J., and Boguraev, B. 2002. Automatic glossary extraction: beyond terminology identification. In Proceedings of the 19th International Conference on Computational Linguistics. Taipei: Howard International House and Academia Sinica, pp. 1–7.Google Scholar

Paulo, J. L., and Mamede, N. J. 2004. Terms spotting with linguistics and statistics. In G. De Ita Luna, O. Fuentes Chávez, and M. Osorio Galindo (eds.), Proceedings of the International Workshop Taller de Herramientas y Recursos Linguísticos para el Español y el Portugués, IX Iberoamerican Conference on Artificial Intelligence, pp. 298–304.Google Scholar

Pazienza, M. T., Pennacchiotti, M., and Zanzotto, F. M. 2005. Terminology extraction: an analysis of linguistic and statistical approaches. In Sirmakessis, S. (ed.), Knowledge Mining. Studies in Fuzziness and Soft Computing, vol. 185, pp. 255–279. Heidelberg: Springer.Google Scholar

Periñán-Pascual, C., 2015. The underpinnings of a composite measure for automatic term extraction: the case of SRC. Terminology 21 (2): 151–179.Google Scholar

Quasthoff, U., Richter, M., and Biemann, C. 2006. Corpus portal for search in monolingual corpora. In Proceedings of LREC-06, Genova, pp. 1799–1802.Google Scholar

Robertson, S. E., Walker, S., and Beaulieu, M. 1998. Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track. In Proceedings of the 7th Text Retrieval Conference, Gaithersburg: National Institute of Standards and Technology, pp. 253–264.Google Scholar

Sajjacholapunt, P., and Joy, M. 2015. Analysing features of lecture slides and past exam paper materials. Towards automatic associating E-materials for self-revision. In Proceedings of the 7th International Conference on Computer Supported Education, Lisbon: SciTePress, pp. 169–176.Google Scholar

Salton, G. (ed.), 1971. The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs: Prentice-Hall.Google Scholar

Salton, G., and Buckley, C., 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24 (5): 513–523.Google Scholar

Salton, G., and McGill, M., 1983. Introduction to Modern Information Retrieval. New York: McGraw Hill.Google Scholar

Salton, G., Wong, A., and Yang, C. S., 1975. A vector space model for automatic indexing. Communications of the ACM 18 (11): 613–620.Google Scholar

Salton, G., Yang, C. S., and Yu, C. T., 1975. A theory of term importance in automatic text analysis. Journal of the American Society for Information Science 26 (1): 33–44.Google Scholar

Silva, J. F., Dias, G., Guilloré, S., and Lopes, G. P. 1999. Using LocalMaxs algorithm for the extraction of contiguous and non-contiguous multiword lexical units. In Barahona, P. (ed.), Progress in Artificial Intelligence: 9th Portuguese Conference on AI, pp. 113–132. Heidelberg: Springer.Google Scholar

Silva, J. F., and Lopes, G. P. 1999. A local maxima method and a fair dispersion normalization for extracting multiword units. In Proceedings of the 6th Meeting on the Mathematics of Language, Orlando, pp. 369–381.Google Scholar

Sinclair, I., 2011. Electronics Simplified. Oxford: Newnes-Elsewier.Google Scholar

Singhal, A., Buckley, C., and Mitra, M. 1996. Pivoted document length normalization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM press, pp. 21–29.Google Scholar

Sinka, M. P., and Corne, D. W. 2003. Towards modernised and web-specific stoplists for web document analysis. In Proceedings of IEEE Web Intelligence 2003. Los Alamitos (California): IEEE Computer Society, pp. 396–404.Google Scholar

Smadja, F., McKeown, K. R., and Hatzivassiloglou, V., 1996. Translating collocations for bilingual lexicons: a statistical approach. Journal of Computational Linguistics 22 (1): 1–38.Google Scholar

Sun, Q., Shaw, D., and Davis, C. H., 1999. A model for estimating the occurrence of same-frequency words and the boundary between high- and low-frequency words in texts. Journal of the American Society for Information Science 50 (3): 280–286.Google Scholar

Thurmair, G. 2003. Making term extraction tools usable. In Proceedings of The Joint Conference of the 8th International Workshop of the European Association of Machine Translation and the 4th Controlled Language Applications Workshop. Dublin: European Association for Machine Translation, pp. 1–10.Google Scholar

Vivaldi, J., Màrquez, L., and Rodríguez, H. 2001. Improving term extraction by system combination using boosting. In Proceedings of the 12th European Conference on Machine Learning, pp. 515–526. Heidelberg: Springer.CrossRef Google Scholar

Vivaldi, J., and Rodríguez, H., 2007. Evaluation of terms and term extraction systems: a practical approach. Terminology 13 (2): 225–248.Google Scholar

Wermter, J., and Hahn, U. 2005. Finding new terminology in very large corpora. In P. Clark, and G. Schreiber (eds.), Proceedings of the 3rd International Conference on Knowledge Capture, pp. 137–144. Alberta: Association for Computing Machinery.Google Scholar

Wiechmann, D., and Fuhs, S., 2006. Corpus linguistics resources. Concordancing software. Corpus Linguistics and Linguistic Theory 2 (1): 109–30.CrossRef Google Scholar

Wong, W., Liu, W., and Bennamoun, M. 2008. Determination of unithood and termhood for term recognition. In Song, M., and Wu, Y. (eds.), Handbook of Research on Text and Web Mining Technologies, pp. 500–529. Hershey-New York: IGI Global.Google Scholar

Zadeh, B. Q., and Handschuh, S. 2014a. Evaluation of technology term recognition with random indexing. In Proceedings of the 9th International Conference on Language Resources and Evaluation. Reykjavik: European Language Resources Association, pp. 4027–4032.Google Scholar

Zadeh, B. Q., and Handschuh, S. 2014b. The ACL RD-TEC: a dataset for benchmarking terminology extraction and classification in computational linguistics. In Proceedings of the 4th International Workshop on Computational Terminology, Dublin: Association for Computational Linguistics, pp. 52–63.Google Scholar

Zhang, Z., Iria, J., Brewster, C., and Ciravegna, F. 2008. A comparative evaluation of term recognition algorithms. In Proceedings of the 6th International Conference on Language Resources and Evaluation. Luxemburg: European Language Resources Association, pp. 2108–2113.Google Scholar

Zorrilla-Agut, P. 2014. When IATE met LISE: LISE clean-up and consolidation tools take on the IATE challenge. In Budin, G., and Lušicky, V. (eds.), Languages for Special Purposes in a Multilingual, Transcultural World. Proceedings of the 19th European Symposium on Languages for Special Purposes, pp. 536–545. Vienna: University of Vienna.Google Scholar

Zou, F., Wang, F. L., Deng, X., Han, S., and Wang, L. S. 2006. Automatic construction of Chinese stop word list. In Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, pp. 1010–1015.Google Scholar

Article contents

DEXTER: A workbench for automatic term extraction with specialized corpora†

Abstract

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests