Learning keyphrases from corpora and knowledge models

R. Silveira; V. Furtado; V. Pinheiro

doi:10.1017/S1351324919000342

Learning keyphrases from corpora and knowledge models

Published online by Cambridge University Press: 10 September 2019

R. Silveira ,

V. Furtado and

V. Pinheiro

Show author details

R. Silveira*: Affiliation:
Eixo de Informação e Comunicação, Instituto Federal de Educação do Ceará (IFCE), Brazil
V. Furtado: Affiliation:
Programa de Pós-graduação em Informática Aplicada, Universidade de Fortaleza (UNIFOR), Brazil.
V. Pinheiro: Affiliation:
Programa de Pós-graduação em Informática Aplicada, Universidade de Fortaleza (UNIFOR), Brazil.
*: *Corresponding author. Email: raquel.vsilveira@hotmail.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Extraction keyphrase systems traditionally use classification algorithms and do not consider the fact that part of the keyphrases may not be found in the text, reducing the accuracy of such algorithms a priori. In this work, we propose to improve the accuracy of these systems with inferential mechanisms that use a knowledge representation model, including symbolic models of knowledge bases and distributional semantics, to expand the set of keyphrase candidates to be submitted to the classification algorithm with terms that are not in the text (not-in-text terms). The basic assumption we have is that not-in-text terms have a semantic relationship with terms that are in the text. To represent this relationship, we have defined two new features to be represented as input to the classification algorithms. The first feature refers to the power of discrimination of the inferred not-in-text terms. The intuition behind this is that good candidates for a keyphrase are those that are deduced from various textual terms in a specific document and that are not often deduced in other documents. The other feature represents the descriptive strength of a not-in-text candidate. We argue that not-in-text keyphrases must have a strong semantic relationship with the text and that the power of this semantic relationship can be measured in a similar way as popular metrics like TFxIDF. The method proposed in this work was compared with state-of-the-art systems using five corpora and the results show that it has significantly improved automatic keyphrase extraction, dealing with the limitation of extracting keyphrases absent from the text.

Keywords

Keyphrases extraction Keyphrases absent from the text Inference of keyphrases Knowledge models

Type: Article
Information: Natural Language Engineering , Volume 26 , Issue 3 , May 2020 , pp. 293 - 318

DOI: https://doi.org/10.1017/S1351324919000342 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ammar, W., Peters, M.E., Bhagavatula, C. and Power, R. (2017). The AI2 system at SemEval-2017 task 10 (ScienceIE): semi-supervised end-to-end entity and relation extraction. In Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017). Vancouver: Association for Computational Linguistics.Google Scholar

Arampatzis, A.T., Tsoris, T., Koster, C.H.A. and van der Weide, Th.P. (1998). Phrase-based information retrieval. In Information Processing and Management, pp. 693–707.CrossRef Google Scholar

Berend, G. (2011). Opinion expression mining by exploiting keyphrase extraction. In Proceedings of the 5th International Joint Conference on Natural Language Processing. Chiang Mai: Asian Federation of Natural Language Processing, pp. 1162–1170.Google Scholar

Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017). Enriching word vectors with subword information. In arXiv:1607.04606.Google Scholar

Bougouin, A., Boudin, F. and Beatrice, D. (2013). Topicrank: Graph-based topic ranking for keyphrase extraction. In Proceedings of the 6th International Joint Conference on Natural Language Processing. Nagoya: Asian Federation of Natural Language Processing, pp. 543–551.Google Scholar

Bougouin, A., Boudin, F. and Daille, B. (2016). Keyphrase annotation with graph co-ranking. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka: The COLING 2016 Organizing Committee, pp. 2945–2955.Google Scholar

Breiman, L. (2001). Random forests. Machine Learning. Dordrecht: Kluwer Academic Publishers, pp. 5–32.Google Scholar

Cândido Júnior, A., Magalhães, C., Caseli, H. and Zangirolami, R. (2015). Topic modeling for keyword extraction: using natural language processing methods for keyword extraction in portal Min@s. Revista de Estudos da Linguagem – RELIN 23(3), 695–726.CrossRef Google Scholar

Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research (JAIR) 16, 321–357.CrossRef Google Scholar

Chen, J., Zhang, X., Wu, Y., Yan, Z. and Li, Z. (2018) Keyphrase generation with correlation constraints. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels: Association for Computational Linguistics, pp. 4057–4066.CrossRef Google Scholar

Danesh, S., Sumner, T. and Martin, J.H. (2015). SGRank: combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction. Lexical and Computational Semantics. Denver, CO: Association for Computational Linguistics.Google Scholar

Debanjan, M., Kuriakose, J., Shan, R.R. and Zimmermann, R. (2018a). Key2Vec: automatic ranked keyphrase extraction from scientific articles using phrase embeddings. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018). New Orleans, LA: Association for Computational Linguistics, pp. 634–639.Google Scholar

Debanjan, M., Kuriakose, J., Shan, R.R., Zimmermann, R. and Talburt, J.R. (2018b). Theme-weighted ranking of keywords from text documents using phrase embeddings. In arXiv:1807.05962v1.Google Scholar

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 391–407.3.0.CO;2-9>CrossRef Google Scholar

Domingos, P. and Pazzani, M. (1997). On the optimality of the simple bayesian classifier under zeroone loss. Machine Learning 29, 103–130.CrossRef Google Scholar

El-Beltagy, S.R. and Rafea, A. (2010). Kpminer: participation in Semeval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation. Uppsala: Association for Computational Linguistics, pp. 190–193.Google Scholar

Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.CrossRef Google Scholar

Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C. and Nevill-Manning, C.G. (1999). Domain-specific keyphrase extraction. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm. San Francisco, CA: Morgan Kaufmann Publishers, pp. 668–673.Google Scholar

Grineva, M., Grinev, M. and Lizorkin, D. (2009). Extraction key terms from noisy and multi-theme documents. In Proceedings of the 18th International Conference on World Wide Web. Madrid: ACM, pp. 661–670.CrossRef Google Scholar

Gutwin, C., Paynter, G., Witten, I.H., Nevill-Manning, C.G. and Frank, E. (1998). Improving browsing in digital libraries with keyphrase indexes. Technical report, Department of Computer Science, University of Saskatchewan, Canada.Google Scholar

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutermann, P. and Witten, I.H. (2009). The WEKA data mining software: an update. SIGKDD Explorations 11(1), 10–18.CrossRef Google Scholar

Han, J., Kim, T. and Choi, J. (2007). Web document clustering by using automatic keyphrase extraction. In Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Washington, DC, USA. IEEE Computer Society, pp. 56–59.CrossRef Google Scholar

Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACM, pp. 216–223.CrossRef Google Scholar

Hulth, A. and Megyesi, B.B. (2016). A study on automatically extracted keywords in text categorization. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics. Sydney: ACM, pp. 537–544.Google Scholar

Kim, S.N., Medelyan, O., Kan, M.Y. and Baldwin, T. (2010). Automatic keyphrase extraction from scientific articles. Language Resources and Evaluation 47, 723–742.CrossRef Google Scholar

Krapivin, M., Autayeu, A. and Marchese, M. (2008). Large dataset for keyphrases extraction. Technical report disi-09-055, DISI, Trento, Italy.Google Scholar

Langville, A.N. and Meyer, C.D. (2004). Deeper inside pagerank. Internet Mathematics 1(3), 335–380.CrossRef Google Scholar

Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014). Beijing: JMLR, pp. 1188–1196.Google Scholar

Litvak, M. and Last, M. (2008). Graph-based keyword extraction for single-document summarization. In Proceedings of the Workshop on Multi-Source Multilingual Information Extraction and Summarization, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 17–24.CrossRef Google Scholar

Liu, H. and Singh, P. (2004). ConceptNet: a practical commonsense reasoning tool-kit. BT Technology Journal 22(4), 211–226.CrossRef Google Scholar

Liu, Z., Huang, W., Zheng, Y. and Sun, M. (2010). Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing.Google Scholar

Manning, C.D., Surdeanum, M., Bauer, J., Finkel, J., Bethard, S.J. and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Baltimore, MD: Association for Computational Linguistics.Google Scholar

Medelyan, O. (2005). Automatic keyphrase indexing with a domain-specific thesaurus. Master thesis. Albert-Ludwigs-Universitaet Freiburg im Breisgau, Germany.Google Scholar

Medelyan, O. (2009). Human-competitive automatic topic indexing. PhD thesis. Department of Computer Science, University of Waikato, New Zealand.Google Scholar

Medelyan, O. and Witten, I.H. (2006). Thesaurus based automatic keyphrase indexing. In Joint Conference on Digital Libraries 2006 (JCDL06).CrossRef Google Scholar

Medelyan, O. and Witten, I.H. (2008). Domain independent automatic keyphrase indexing with small training sets. Journal of the American Society for Information Science and Technology 59, 1026–1040.CrossRef Google Scholar

Medelyan, O., Frank, E. and Witten, I.H. (2009). Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. EMNLP ’09, vol. 3. Stroudsburg, PA, USA, pp. 1318–1327.CrossRef Google Scholar

Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P. and Chi, Y. (2017). Deep keyphrase generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017). Vancouver: Association for Computational Linguistics.Google Scholar

Mihalcea, H. and Tarau, P. (2004). TextRank: bringing order into texts. In Association for Computational Linguistics. Barcelona, pp. 404–411.Google Scholar

Mihalcea, R. and Andra, C. (2007). Wikify!: Linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (CIKM07). Lisbon: ACM, pp. 233–242.CrossRef Google Scholar

Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of International Conference on Learning Representations - ICLR Workshop.Google Scholar

Miller, G.A. (1995). WordNet: A Lexical database for english. Communications of the ACM 38(11), 39–41.CrossRef Google Scholar

Miwa, M. and Bansal, M. (2016). End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin: Association for Computational Linguistics, pp. 1105–1116.CrossRef Google Scholar

Nguyen, T. and Kan, M. (2007). Keyphrase extraction in scientific publications. In Proceedings of the 10th International Conference on Asian Digital Libraries. Hanoi: Springer-Verlag Berlin, pp. 317–326.CrossRef Google Scholar

Pagliardini, M., Gupta, P. and Jaggi, M. (2017). Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. In arXiv:1703.02507v2.Google Scholar

Papagiannopoulou, E. and Tsoumakas, G. (2018). Local Word Vectors Guiding Keyphrase Extraction. In arXiv:1710.07503.Google Scholar

Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: global vectors for word representation. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP). Brussels: Association for Computational Linguistics, pp. 1532–1543.CrossRef Google Scholar

Pinheiro, V., Furtado, V., Pequeno, T. and Nogueira, D. (2010). Natural language processing based on semantiberc inferentialism for extracting crime information from text. In IEEE international conference on intelligence and security informatics (ISI). IEEE (2010). Vancouver, BC, Canada, pp. 19–24.CrossRef Google Scholar

Salton, G. and McGill, M. J. (1983) Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill.Google Scholar

Saric, F., Glavas, G., Karan, M., Snajder, J. and Basic, B.D. (2012). Takelab: systems for measuring semantic text similarity. In First Joint Conference on Lexical and Computational Semantics, Canada, pp. 441–448.Google Scholar

Silveira, R., Furtado, V. and Pinheiro, V. (2015). Ranking keyphrases from semantic and syntactic features of textual terms. In 2015 Brazilian Conference on Intelligent Systems (BRACIS).CrossRef Google Scholar

Sterckx, L., Caragea, C., Demeester, T. and Develder, C. (2016). Supervised keyphrase extraction as positive unlabeled learning. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, TX: Association for Computational Linguistics, pp. 1924–1929.CrossRef Google Scholar

Tu, Z., Lu, Z., Liu, Y., Liu, X. and Li, H. (2016). Modeling coverage for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin: Association for Computational Linguistics, pp. 76–85.CrossRef Google Scholar

Turney, P.D. (2000) Learning algorithms for keyphrase extraction. Information Retrieval 2(4), 303–336.CrossRef Google Scholar

Wang, L. and Li, S. (2017). PKU-ICL at SemEval-2017 task 10: keyphrase extraction with model ensemble and external knowledge. In Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017). Vancouver: Association for Computational Linguistics, pp. 934–937.CrossRef Google Scholar

Wang, R., Liu, W. and McDonald, C. (2015). Using Word Embeddings to Enhance Keyword Identification for Scientific Publications. In Australasian Database Conference, pp. 257–268.CrossRef Google Scholar

Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G. (1999). KEA: a practical automatic keyphrase extraction. In DL99 Proceedings of the fourth ACM Conference on Digital Libraries.CrossRef Google Scholar

Ye, H. and Wang, L. (2018). Semi-Supervised Learning for Neural Keyphrase Generation. In arXiv:1808.06773.Google Scholar

Zhang, Q., Wang, Y., Gong, Y. and Huang, X. (2016). Keyphrase extraction using deep recurrent neural networks on twitter. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX: Association for Computational Linguistics, pp. 836–845.CrossRef Google Scholar

Zhang, X., Zhao, J. and Lecun, Y. (2015). Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS 2015). Montreal: MIT Press Cambridge, pp. 649–657.Google Scholar

Zhang, Y., Li, J., Song, Y. and Zhang, C. (2018). Enconding Conversation Contexto for Neural Keyphrase Extraction from Microblog Posts. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018). Minneapolis, MN: Association for Computational Linguistics, pp. 1676–1686.Google Scholar

Article contents

Learning keyphrases from corpora and knowledge models

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests