Hostname: page-component-7479d7b7d-t6hkb Total loading time: 0 Render date: 2024-07-11T11:15:46.177Z Has data issue: false hasContentIssue false

Learning keyphrases from corpora and knowledge models

Published online by Cambridge University Press:  10 September 2019

R. Silveira*
Affiliation:
Eixo de Informação e Comunicação, Instituto Federal de Educação do Ceará (IFCE), Brazil
V. Furtado
Affiliation:
Programa de Pós-graduação em Informática Aplicada, Universidade de Fortaleza (UNIFOR), Brazil.
V. Pinheiro
Affiliation:
Programa de Pós-graduação em Informática Aplicada, Universidade de Fortaleza (UNIFOR), Brazil.
*
*Corresponding author. Email: raquel.vsilveira@hotmail.com

Abstract

Extraction keyphrase systems traditionally use classification algorithms and do not consider the fact that part of the keyphrases may not be found in the text, reducing the accuracy of such algorithms a priori. In this work, we propose to improve the accuracy of these systems with inferential mechanisms that use a knowledge representation model, including symbolic models of knowledge bases and distributional semantics, to expand the set of keyphrase candidates to be submitted to the classification algorithm with terms that are not in the text (not-in-text terms). The basic assumption we have is that not-in-text terms have a semantic relationship with terms that are in the text. To represent this relationship, we have defined two new features to be represented as input to the classification algorithms. The first feature refers to the power of discrimination of the inferred not-in-text terms. The intuition behind this is that good candidates for a keyphrase are those that are deduced from various textual terms in a specific document and that are not often deduced in other documents. The other feature represents the descriptive strength of a not-in-text candidate. We argue that not-in-text keyphrases must have a strong semantic relationship with the text and that the power of this semantic relationship can be measured in a similar way as popular metrics like TFxIDF. The method proposed in this work was compared with state-of-the-art systems using five corpora and the results show that it has significantly improved automatic keyphrase extraction, dealing with the limitation of extracting keyphrases absent from the text.

Type
Article
Copyright
© Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ammar, W., Peters, M.E., Bhagavatula, C. and Power, R. (2017). The AI2 system at SemEval-2017 task 10 (ScienceIE): semi-supervised end-to-end entity and relation extraction. In Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017). Vancouver: Association for Computational Linguistics.Google Scholar
Arampatzis, A.T., Tsoris, T., Koster, C.H.A. and van der Weide, Th.P. (1998). Phrase-based information retrieval. In Information Processing and Management, pp. 693707.CrossRefGoogle Scholar
Berend, G. (2011). Opinion expression mining by exploiting keyphrase extraction. In Proceedings of the 5th International Joint Conference on Natural Language Processing. Chiang Mai: Asian Federation of Natural Language Processing, pp. 11621170.Google Scholar
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017). Enriching word vectors with subword information. In arXiv:1607.04606.Google Scholar
Bougouin, A., Boudin, F. and Beatrice, D. (2013). Topicrank: Graph-based topic ranking for keyphrase extraction. In Proceedings of the 6th International Joint Conference on Natural Language Processing. Nagoya: Asian Federation of Natural Language Processing, pp. 543551.Google Scholar
Bougouin, A., Boudin, F. and Daille, B. (2016). Keyphrase annotation with graph co-ranking. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka: The COLING 2016 Organizing Committee, pp. 29452955.Google Scholar
Breiman, L. (2001). Random forests. Machine Learning. Dordrecht: Kluwer Academic Publishers, pp. 532.Google Scholar
Cândido Júnior, A., Magalhães, C., Caseli, H. and Zangirolami, R. (2015). Topic modeling for keyword extraction: using natural language processing methods for keyword extraction in portal Min@s. Revista de Estudos da Linguagem – RELIN 23(3), 695726.CrossRefGoogle Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research (JAIR) 16, 321357.CrossRefGoogle Scholar
Chen, J., Zhang, X., Wu, Y., Yan, Z. and Li, Z. (2018) Keyphrase generation with correlation constraints. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels: Association for Computational Linguistics, pp. 40574066.CrossRefGoogle Scholar
Danesh, S., Sumner, T. and Martin, J.H. (2015). SGRank: combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction. Lexical and Computational Semantics. Denver, CO: Association for Computational Linguistics.Google Scholar
Debanjan, M., Kuriakose, J., Shan, R.R. and Zimmermann, R. (2018a). Key2Vec: automatic ranked keyphrase extraction from scientific articles using phrase embeddings. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018). New Orleans, LA: Association for Computational Linguistics, pp. 634639.Google Scholar
Debanjan, M., Kuriakose, J., Shan, R.R., Zimmermann, R. and Talburt, J.R. (2018b). Theme-weighted ranking of keywords from text documents using phrase embeddings. In arXiv:1807.05962v1.Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 391407.3.0.CO;2-9>CrossRefGoogle Scholar
Domingos, P. and Pazzani, M. (1997). On the optimality of the simple bayesian classifier under zeroone loss. Machine Learning 29, 103130.CrossRefGoogle Scholar
El-Beltagy, S.R. and Rafea, A. (2010). Kpminer: participation in Semeval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation. Uppsala: Association for Computational Linguistics, pp. 190193.Google Scholar
Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.CrossRefGoogle Scholar
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C. and Nevill-Manning, C.G. (1999). Domain-specific keyphrase extraction. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm. San Francisco, CA: Morgan Kaufmann Publishers, pp. 668673.Google Scholar
Grineva, M., Grinev, M. and Lizorkin, D. (2009). Extraction key terms from noisy and multi-theme documents. In Proceedings of the 18th International Conference on World Wide Web. Madrid: ACM, pp. 661670.CrossRefGoogle Scholar
Gutwin, C., Paynter, G., Witten, I.H., Nevill-Manning, C.G. and Frank, E. (1998). Improving browsing in digital libraries with keyphrase indexes. Technical report, Department of Computer Science, University of Saskatchewan, Canada.Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutermann, P. and Witten, I.H. (2009). The WEKA data mining software: an update. SIGKDD Explorations 11(1), 1018.CrossRefGoogle Scholar
Han, J., Kim, T. and Choi, J. (2007). Web document clustering by using automatic keyphrase extraction. In Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Washington, DC, USA. IEEE Computer Society, pp. 5659.CrossRefGoogle Scholar
Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACM, pp. 216223.CrossRefGoogle Scholar
Hulth, A. and Megyesi, B.B. (2016). A study on automatically extracted keywords in text categorization. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics. Sydney: ACM, pp. 537544.Google Scholar
Kim, S.N., Medelyan, O., Kan, M.Y. and Baldwin, T. (2010). Automatic keyphrase extraction from scientific articles. Language Resources and Evaluation 47, 723742.CrossRefGoogle Scholar
Krapivin, M., Autayeu, A. and Marchese, M. (2008). Large dataset for keyphrases extraction. Technical report disi-09-055, DISI, Trento, Italy.Google Scholar
Langville, A.N. and Meyer, C.D. (2004). Deeper inside pagerank. Internet Mathematics 1(3), 335380.CrossRefGoogle Scholar
Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014). Beijing: JMLR, pp. 11881196.Google Scholar
Litvak, M. and Last, M. (2008). Graph-based keyword extraction for single-document summarization. In Proceedings of the Workshop on Multi-Source Multilingual Information Extraction and Summarization, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 1724.CrossRefGoogle Scholar
Liu, H. and Singh, P. (2004). ConceptNet: a practical commonsense reasoning tool-kit. BT Technology Journal 22(4), 211226.CrossRefGoogle Scholar
Liu, Z., Huang, W., Zheng, Y. and Sun, M. (2010). Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing.Google Scholar
Manning, C.D., Surdeanum, M., Bauer, J., Finkel, J., Bethard, S.J. and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Baltimore, MD: Association for Computational Linguistics.Google Scholar
Medelyan, O. (2005). Automatic keyphrase indexing with a domain-specific thesaurus. Master thesis. Albert-Ludwigs-Universitaet Freiburg im Breisgau, Germany.Google Scholar
Medelyan, O. (2009). Human-competitive automatic topic indexing. PhD thesis. Department of Computer Science, University of Waikato, New Zealand.Google Scholar
Medelyan, O. and Witten, I.H. (2006). Thesaurus based automatic keyphrase indexing. In Joint Conference on Digital Libraries 2006 (JCDL06).CrossRefGoogle Scholar
Medelyan, O. and Witten, I.H. (2008). Domain independent automatic keyphrase indexing with small training sets. Journal of the American Society for Information Science and Technology 59, 10261040.CrossRefGoogle Scholar
Medelyan, O., Frank, E. and Witten, I.H. (2009). Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. EMNLP ’09, vol. 3. Stroudsburg, PA, USA, pp. 13181327.CrossRefGoogle Scholar
Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P. and Chi, Y. (2017). Deep keyphrase generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017). Vancouver: Association for Computational Linguistics.Google Scholar
Mihalcea, H. and Tarau, P. (2004). TextRank: bringing order into texts. In Association for Computational Linguistics. Barcelona, pp. 404411.Google Scholar
Mihalcea, R. and Andra, C. (2007). Wikify!: Linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (CIKM07). Lisbon: ACM, pp. 233242.CrossRefGoogle Scholar
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of International Conference on Learning Representations - ICLR Workshop.Google Scholar
Miller, G.A. (1995). WordNet: A Lexical database for english. Communications of the ACM 38(11), 3941.CrossRefGoogle Scholar
Miwa, M. and Bansal, M. (2016). End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin: Association for Computational Linguistics, pp. 11051116.CrossRefGoogle Scholar
Nguyen, T. and Kan, M. (2007). Keyphrase extraction in scientific publications. In Proceedings of the 10th International Conference on Asian Digital Libraries. Hanoi: Springer-Verlag Berlin, pp. 317326.CrossRefGoogle Scholar
Pagliardini, M., Gupta, P. and Jaggi, M. (2017). Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. In arXiv:1703.02507v2.Google Scholar
Papagiannopoulou, E. and Tsoumakas, G. (2018). Local Word Vectors Guiding Keyphrase Extraction. In arXiv:1710.07503.Google Scholar
Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: global vectors for word representation. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP). Brussels: Association for Computational Linguistics, pp. 15321543.CrossRefGoogle Scholar
Pinheiro, V., Furtado, V., Pequeno, T. and Nogueira, D. (2010). Natural language processing based on semantiberc inferentialism for extracting crime information from text. In IEEE international conference on intelligence and security informatics (ISI). IEEE (2010). Vancouver, BC, Canada, pp. 1924.CrossRefGoogle Scholar
Salton, G. and McGill, M. J. (1983) Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill.Google Scholar
Saric, F., Glavas, G., Karan, M., Snajder, J. and Basic, B.D. (2012). Takelab: systems for measuring semantic text similarity. In First Joint Conference on Lexical and Computational Semantics, Canada, pp. 441448.Google Scholar
Silveira, R., Furtado, V. and Pinheiro, V. (2015). Ranking keyphrases from semantic and syntactic features of textual terms. In 2015 Brazilian Conference on Intelligent Systems (BRACIS).CrossRefGoogle Scholar
Sterckx, L., Caragea, C., Demeester, T. and Develder, C. (2016). Supervised keyphrase extraction as positive unlabeled learning. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, TX: Association for Computational Linguistics, pp. 19241929.CrossRefGoogle Scholar
Tu, Z., Lu, Z., Liu, Y., Liu, X. and Li, H. (2016). Modeling coverage for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin: Association for Computational Linguistics, pp. 7685.CrossRefGoogle Scholar
Turney, P.D. (2000) Learning algorithms for keyphrase extraction. Information Retrieval 2(4), 303336.CrossRefGoogle Scholar
Wang, L. and Li, S. (2017). PKU-ICL at SemEval-2017 task 10: keyphrase extraction with model ensemble and external knowledge. In Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017). Vancouver: Association for Computational Linguistics, pp. 934937.CrossRefGoogle Scholar
Wang, R., Liu, W. and McDonald, C. (2015). Using Word Embeddings to Enhance Keyword Identification for Scientific Publications. In Australasian Database Conference, pp. 257268.CrossRefGoogle Scholar
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G. (1999). KEA: a practical automatic keyphrase extraction. In DL99 Proceedings of the fourth ACM Conference on Digital Libraries.CrossRefGoogle Scholar
Ye, H. and Wang, L. (2018). Semi-Supervised Learning for Neural Keyphrase Generation. In arXiv:1808.06773.Google Scholar
Zhang, Q., Wang, Y., Gong, Y. and Huang, X. (2016). Keyphrase extraction using deep recurrent neural networks on twitter. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX: Association for Computational Linguistics, pp. 836845.CrossRefGoogle Scholar
Zhang, X., Zhao, J. and Lecun, Y. (2015). Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS 2015). Montreal: MIT Press Cambridge, pp. 649657.Google Scholar
Zhang, Y., Li, J., Song, Y. and Zhang, C. (2018). Enconding Conversation Contexto for Neural Keyphrase Extraction from Microblog Posts. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018). Minneapolis, MN: Association for Computational Linguistics, pp. 16761686.Google Scholar