Hostname: page-component-76fb5796d-zzh7m Total loading time: 0 Render date: 2024-04-25T20:49:50.679Z Has data issue: false hasContentIssue false

Parameter-efficient feature-based transfer for paraphrase identification

Published online by Cambridge University Press:  19 December 2022

Xiaodong Liu*
Affiliation:
Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido, Japan
Rafal Rzepka
Affiliation:
Faculty of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido, Japan
Kenji Araki
Affiliation:
Faculty of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido, Japan
*
*Corresponding author. E-mail: xiaodongliu@ist.hokudai.ac.jp

Abstract

There are many types of approaches for Paraphrase Identification (PI), an NLP task of determining whether a sentence pair has equivalent semantics. Traditional approaches mainly consist of unsupervised learning and feature engineering, which are computationally inexpensive. However, their task performance is moderate nowadays. To seek a method that can preserve the low computational costs of traditional approaches but yield better task performance, we take an investigation into neural network-based transfer learning approaches. We discover that by improving the usage of parameters efficiently for feature-based transfer, our research goal can be accomplished. Regarding the improvement, we propose a pre-trained task-specific architecture. The fixed parameters of the pre-trained architecture can be shared by multiple classifiers with small additional parameters. As a result, the computational cost left involving parameter update is only generated from classifier-tuning: the features output from the architecture combined with lexical overlap features are fed into a single classifier for tuning. Furthermore, the pre-trained task-specific architecture can be applied to natural language inference and semantic textual similarity tasks as well. Such technical novelty leads to slight consumption of computational and memory resources for each task and is also conducive to power-efficient continual learning. The experimental results show that our proposed method is competitive with adapter-BERT (a parameter-efficient fine-tuning approach) over some tasks while consuming only 16% trainable parameters and saving 69-96% time for parameter update.

Type
Article
Copyright
© The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Beltagy, I., Lo, K. and Cohan, A. (2019). SciBERT: a pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics, pp. 36153620.CrossRefGoogle Scholar
Bengio, Y., Louradour, J., Collobert, R. and Weston, J. (2009). Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning. ICML’09. New York, NY, USA: Association for Computing Machinery, pp. 4148.CrossRefGoogle Scholar
Bromley, J., Bentz, J.W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., Säckinger, E. and Shah, R. (1993). Signature verification using a “siamese”’ time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence 7(04), 669688.CrossRefGoogle Scholar
Caruana, R. (1998). Multitask Learning. Boston, MA: Springer US, pp. 95133.Google Scholar
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I. and Specia, L. (2017). SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada. Association for Computational Linguistics, pp. 114.CrossRefGoogle Scholar
Chandrasekaran, D. and Mago, V. (2020). Evolution of semantic similarity - A survey. CoRR, abs/2004.13820.Google Scholar
Chen, T., Goodfellow, I. and Shlens, J. (2016). Net2net: accelerating learning via knowledge transfer.Google Scholar
Conneau, A. and Kiela, D. (2018). SentEval: an evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).Google Scholar
Corbeil, J.-P. and Abdi Ghavidel, H. (2021). Assessing the eligibility of backtranslated samples based on semantic similarity for the paraphrase identification task. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 301308, Held Online. INCOMA Ltd.CrossRefGoogle Scholar
Das, D. and Smith, N.A. (2009). Paraphrase identification as probabilistic quasi-synchronous recognition. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore. Association for Computational Linguistics, pp. 468476.CrossRefGoogle Scholar
de Masson d’Autume, C., Ruder, S., Kong, L. and Yogatama, D. (2019). Episodic memory in lifelong language learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. and Garnett, R. (eds), Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc.Google Scholar
de Wynter, A. and Perry, D.J. (2020). Optimal subarchitecture extraction for BERT. CoRR. https://arxiv.org/abs/2010.10499 Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391407.3.0.CO;2-9>CrossRefGoogle Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. and Fei-Fei, L. (2009). Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 248255.CrossRefGoogle Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics, pp. 41714186.Google Scholar
Dolan, B., Quirk, C. and Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland. COLING, pp. 350356.CrossRefGoogle Scholar
Dong, D., Wu, H., He, W., Yu, D. and Wang, H. (2015). Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China. Association for Computational Linguistics, pp. 17231732.CrossRefGoogle Scholar
Faruqui, M., Dodge, J., Jauhar, S.K., Dyer, C., Hovy, E. and Smith, N.A. (2015). Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado. Association for Computational Linguistics, pp. 16061615.CrossRefGoogle Scholar
Fellbaum, C. (ed.) (1998). WordNet: An Electronic Lexical Database, Language, Speech, and Communication. Cambridge, MA: MIT Press.CrossRefGoogle Scholar
Fernando, S. and Stevenson, M. (2008). A semantic similarity approach to paraphrase detection. In Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 4552.Google Scholar
French, R.M. (1999). Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3(4), 128135.CrossRefGoogle ScholarPubMed
Gammerman, A., Vovk, V. and Vapnik, V. (1998). Learning by transduction. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc, pp. 148155.Google Scholar
Glavaš, G. and Vulić, I. (2018). Explicit retrofitting of distributional word vectors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 3445.CrossRefGoogle Scholar
Guo, W. and Diab, M. (2012). Modeling sentences in the latent space. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-volume 1. Association for Computational Linguistics, pp. 864872.Google Scholar
He, H., Gimpel, K. and Lin, J. (2015). Multi-perspective sentence similarity modeling with convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. Association for Computational Linguistics, pp. 15761586.CrossRefGoogle Scholar
He, Y., Wang, Z., Zhang, Y., Huang, R. and Caverlee, J. (2020). PARADE: a new dataset for paraphrase identification requiring computer science domain knowledge. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online. Association for Computational Linguistics, pp. 75727582.CrossRefGoogle Scholar
Hedderich, M.A., Lange, L., Adel, H., Strötgen, J. and Klakow, D. (2021). A survey on recent approaches for natural language processing in low-resource scenarios. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online. Association for Computational Linguistics, pp. 25452568.CrossRefGoogle Scholar
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M. and Gelly, S. (2019a). Parameter-efficient transfer learning for NLP. In Chaudhuri, K. and Salakhutdinov, R. (eds), Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research , vol. 97. PMLR, pp. 27902799.Google Scholar
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M. and Gelly, S. (2019b). Parameter-efficient transfer learning for nlp.Google Scholar
Howard, J. and Ruder, S. (2018). Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 328339.CrossRefGoogle Scholar
Hung, C.-Y., Tu, C.-H., Wu, C.-E., Chen, C.-H., Chan, Y.-M. and Chen, C.-S. (2019). Compacting, picking and growing for unforgetting continual learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. and Garnett, R. (eds), Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc.Google Scholar
Iyer, S., Dandekar, N. and Csernai, K. (2017). First Quora Dataset Release: Question Pairs. Available at: https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs.Google Scholar
Jaccard, P. (1912). The distribution of the flora in the alpine zone. 1. New Phytologist 11(2), 3750.CrossRefGoogle Scholar
Ji, Y. and Eisenstein, J. (2013). Discriminative improvements to distributional sentence similarity. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 891896.Google Scholar
Kadotani, S., Kajiwara, T., Arase, Y. and Onizuka, M. (2021). Edit distance based curriculum learning for paraphrase generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop. Online. Association for Computational Linguistics, pp. 229234.CrossRefGoogle Scholar
Khot, T., Sabharwal, A. and Clark, P. (2018). SciTail: a textual entailment dataset from science question answering. In AAAI.CrossRefGoogle Scholar
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D. and Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of The National Academy of Sciences of The United States of America 114(13), 35213526.CrossRefGoogle ScholarPubMed
Lan, W., Qiu, S., He, H. and Xu, W. (2017). A continuously growing dataset of sentential paraphrases. In Proceedings of The 2017 Conference on Empirical Methods on Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 12351245.CrossRefGoogle Scholar
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P. and Soricut, R. (2019). ALBERT: a lite BERT for self-supervised learning of language representations. CoRR, abs/1909.11942.Google Scholar
Leong, C.W.B., Beigman Klebanov, B., Hamill, C., Stemle, E., Ubale, R. and Chen, X. (2020). A report on the 2020 VUA and TOEFL metaphor detection shared task. In Proceedings of the Second Workshop on Figurative Language Processing. Online. Association for Computational Linguistics, pp. 1829.CrossRefGoogle Scholar
Leong, C.W.B., Beigman Klebanov, B. and Shutova, E. (2018). A report on the 2018 VUA metaphor detection shared task. In Proceedings of the Workshop on Figurative Language Processing, New Orleans, Louisiana. Association for Computational Linguistics, pp. 5666.CrossRefGoogle Scholar
Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z. and Tang, J. (2021). Gpt understands, too.Google Scholar
Lopez-Paz, D. and Ranzato, M.A. (2017). Gradient episodic memory for continual learning. In Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S. and Garnett, R. (eds), Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc.Google Scholar
Madnani, N., Tetreault, J. and Chodorow, M. (2012). Re-examining machine translation metrics for paraphrase identification. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montréal, Canada. Association for Computational Linguistics, pp. 182190.Google Scholar
Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S. and Zamparelli, R. (2014). SemEval-2014 task 1: evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland. Association for Computational Linguistics, pp. 18.CrossRefGoogle Scholar
McCann, B., Keskar, N.S., Xiong, C. and Socher, R. (2018). The natural language decathlon: multitask learning as question answering. CoRR, abs/1806.08730.Google Scholar
McCloskey, M. and Cohen, N. J. (1989). Catastrophic interference in connectionist networks: the sequential learning problem. Psychology of Learning and Motivation 24, 109165, Academic Press.CrossRefGoogle Scholar
Meng, Y., Ao, X., He, Q., Sun, X., Han, Q., Wu, F., Fan, C. and Li, J. (2021). ConRPG: paraphrase generation using contexts as regularizer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics, pp. 25512562.CrossRefGoogle Scholar
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C. and Joulin, A. (2018). Advances in pre-training distributed word representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pp. 31113119.Google Scholar
Mohiuddin, T. and Joty, S. (2020). Unsupervised word translation with adversarial autoencoder. Computational Linguistics 46(2), 257288.CrossRefGoogle Scholar
Nighojkar, A. and Licato, J. (2021). Improving paraphrase detection with the adversarial paraphrasing task. CoRR, abs/2106.07691.CrossRefGoogle Scholar
Pennington, J., Socher, R. and Manning, C. (2014). GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar. Association for Computational Linguistics, pp. 15321543.CrossRefGoogle Scholar
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana. Association for Computational Linguistics, pp. 22272237.CrossRefGoogle Scholar
Popović, M. (2015). chrF: character n-gram f-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal. Association for Computational Linguistics, pp. 392395.CrossRefGoogle Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I. (2019). Language models are unsupervised multitask learners.Google Scholar
Reimers, N. and Gurevych, I. (2019). Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics, pp. 39823992.CrossRefGoogle Scholar
Rusu, A.A., Rabinowitz, N.C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R. and Hadsell, R. (2016). Progressive neural networks.Google Scholar
Sanh, V., Debut, L., Chaumond, J. and Wolf, T. (2019). Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.Google Scholar
Sellam, T., Das, D. and Parikh, A.P. (2020, BLEURT: learning robust metrics for text generation, CoRR, abs/2004.04696.CrossRefGoogle Scholar
Shi, W., Chen, M., Zhou, P. and Chang, K.-W. (2019). Retrofitting contextualized word embeddings with paraphrases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics, pp. 11981203.CrossRefGoogle Scholar
Socher, R., Huang, E.H., Pennin, J., Manning, C.D. and Ng, A.Y. (2011). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Advances in Neural Information Processing Systems, pp. 801809.Google Scholar
Strubell, E., Ganesh, A. and McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics, pp. 36453650.CrossRefGoogle Scholar
Sun, F.-K., Ho, C.-H. and Lee, H.-Y. (2020). LAMOL: LAnguage MOdeling for lifelong language learning. In International Conference on Learning Representations.Google Scholar
Thrun, S. (1998). Lifelong Learning Algorithms. Boston, MA: Springer US, pp. 181209.Google Scholar
van de Ven, G. M. and Tolias, A. S. (2019). Three scenarios for continual learning.Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, pp. 59986008.Google Scholar
Wan, S., Dras, M., Dale, R. and Paris, C. (2006). Using dependency-based features to take the’para-farce’out of paraphrase. In Proceedings of the Australasian Language Technology Workshop 2006, pp. 131138.Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S. (2018). GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium. Association for Computational Linguistics, pp. 353355.CrossRefGoogle Scholar
Weston, J., Bordes, A., Chopra, S., Rush, A. M., van Merriënboer, B., Joulin, A. and Mikolov, T. (2015). Towards ai-complete question answering: a set of prerequisite toy tasks.Google Scholar
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R. and Le, Q.V. (2019). Xlnet: generalized autoregressive pretraining for language understanding. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. and Garnett, R. (eds), Advances in Neural Information Processing Systems 32. Curran Associates, Inc, pp. 57535763.Google Scholar
Yin, W. and Schütze, H. (2015a). Convolutional neural network for paraphrase identification. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado. Association for Computational Linguistics, pp. 901911.CrossRefGoogle Scholar
Yin, W. and Schütze, H. (2015b). Discriminative phrase embedding for paraphrase identification. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado. Association for Computational Linguistics, pp. 13681373.CrossRefGoogle Scholar
Yin, W., Schütze, H., Xiang, B. and Zhou, B. (2016). ABCNN: attention-based convolutional neural network for modeling sentence pairs. Transactions of the Association for Computational Linguistics 4, 259272.CrossRefGoogle Scholar
Yosinski, J., Clune, J., Bengio, Y. and Lipson, H. (2014). How transferable are features in deep neural networks?. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. and Weinberger, K. Q. (eds), Advances in Neural Information Processing Systems, vol. 27, Curran Associates, Inc.Google Scholar
Yu, Z., Cohen, T., Wallace, B., Bernstam, E. and Johnson, T. (2016). Retrofitting word vectors of MeSH terms to improve semantic similarity measures. In Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis, Auxtin, TX. Association for Computational Linguistics, pp. 4351.CrossRefGoogle Scholar
Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J. and Savarese, S. (2018). Taskonomy: disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).CrossRefGoogle Scholar
Zhang, Y., Baldridge, J. and He, L. (2019). PAWS: paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics, pp. 12981308.Google Scholar