SEN: A subword-based ensemble network for Chinese historical entity extraction

Chengxi Yan; Ruojia Wang; Xiaoke Fang

doi:10.1017/S1351324922000493

SEN: A subword-based ensemble network for Chinese historical entity extraction

Published online by Cambridge University Press: 22 December 2022

Chengxi Yan

Ruojia Wang and

Xiaoke Fang

Show author details

Chengxi Yan: Affiliation:
School of Information Resource Management, Renmin University of China, Beijing, China Research Center for Digital Humanities of RUC, Beijing, China
Ruojia Wang*: Affiliation:
School of Management, Beijing University of Chinese Medicine, Beijing, China
Xiaoke Fang: Affiliation:
College of Applied Arts and Science, Beijing Union University, Beijing, China
*: *Corresponding author. E-mail: ruojia_wang@qq.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Understanding various historical entity information (e.g., persons, locations, and time) plays a very important role in reasoning about the developments of historical events. With the increasing concern about the fields of digital humanities and natural language processing, named entity recognition (NER) provides a feasible solution for automatically extracting these entities from historical texts, especially in Chinese historical research. However, previous approaches are domain-specific, ineffective with relatively low accuracy, and non-interpretable, which hinders the development of NER in Chinese history. In this paper, we propose a new hybrid deep learning model called “subword-based ensemble network” (SEN), by incorporating subword information and a novel attention fusion mechanism. The experiments on a massive self-built Chinese historical corpus CMAG show that SEN has achieved the best with 93.87% for F1-micro and 89.70% for F1-macro, compared with other advanced models. Further investigation reveals that SEN has a strong generalization ability of NER on Chinese historical texts, which is not only relatively insensitive to the categories with fewer annotation labels (e.g., OFI) but can also accurately capture diverse local and global semantic relations. Our research demonstrates the effectiveness of the integration of subword information and attention fusion, which provides an inspiring solution for the practical use of entity extraction in the Chinese historical domain.

Keywords

Named entity recognition Entity extraction Neural network Subword Chinese history

Type: Article
Information: Natural Language Engineering , Volume 29 , Issue 4 , July 2023 , pp. 1043 - 1065

DOI: https://doi.org/10.1017/S1351324922000493 [Opens in a new window]
Copyright: © The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bingenheimer, M. (2015). The digital archive of Buddhist temple gazetteers and named entity recognition (NER) in classical Chinese. Lingua Sinica 1, 8.CrossRef Google Scholar

Bhojanapalli, S., Yun, C., Rawat, A.S., Reddi, S. and Kumar, S. (2020). Low-rank bottleneck in multi-head attention models. In Proceedings of the 37th International Conference on Machine Learning, (ICML), pp. 864–873.Google Scholar

Botha, J. and Blunsom, P. (2014). Compositional morphology for word representations and language modelling. In: Proceedings of the 31st International Conference on International Conference on Machine Learning, (ICML), pp. 1899–1907.Google Scholar

Byrne, K. (2007). Nested named entity recognition in historical archive text. In International Conference on Semantic Computing. Washington D.C.: IEEE Computer Society, pp. 589–596.CrossRef Google Scholar

Cao, S. and Lu, W. (2017). Improving word embeddings with convolutional feature learning and subword information. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. Palo Alto, California: Association for the Advancement of Artificial Intelligence, pp. 3144–3151.CrossRef Google Scholar

Cao, S., Lu, W., Zhou, J. and Li, X. (2018). Cw2vec: Learning Chinese word embeddings with stroke n-gram information. In Proceedings of 32nd AAAI Conference on Artificial Intelligence. Palo Alto, California: Association for the Advancement of Artificial Intelligence, pp. 5053–5061.CrossRef Google Scholar

Chaudhary, A., Zhou, C., Levin, L., Neubig, G., Mortensen, D.R. and Carbonell, J.G. (2018). Adapting word embeddings to new languages with morphological and phonological subword representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 3285–3295.CrossRef Google Scholar

Chen, A., Peng, F., Shan, R. and Sun, G. (2006). Chinese named entity recognition with conditional probabilistic models. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 173–176.Google Scholar

Chiu, J.-P. and Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4, 357–370.CrossRef Google Scholar

Cho, H.-C., Okazaki, N., Miwa, M. and Tsujii, J.I. (2013). Named entity recognition with multiple segment representations. Information Processing and Management 49, 954–965.CrossRef Google Scholar

Dauphin, Y.N., Fan, A., Auli, M. and Grangier, D. (2017). Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning, (ICML), pp. 933–941.Google Scholar

De Weerdt, H. (2020). Creating, linking, and analyzing Chinese and Korean datasets: Digital text annotation in MARKUS and COMPARATIVUS. Journal of Chinese History 4, 519–527.CrossRef Google Scholar

Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 4171–4186.Google Scholar

E, S. and Xiang, Y. (2017). Chinese named entity recognition with character-word mixed embedding. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. New York: Association for Computing Machinery, pp. 2055–2058.CrossRef Google Scholar

Forney, G.D. (1973). The viterbi algorithm. Proceedings of the IEEE 61, 268–278.CrossRef Google Scholar

Gong, C., Li, Z., Xia, Q., Chen, W. and Zhang, M. (2020). Hierarchical LSTM with char-subword-word tree-structure representation for Chinese named entity recognition. Science China Information Sciences 63, 1–15.CrossRef Google Scholar

Gui, T., Ma, R., Zhang, Q., Zhao, L., Jiang, Y.G. and Huang, X. (2019). CNN-based Chinese NER with lexicon rethinking. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. Palo Alto, California: Association for the Advancement of Artificial Intelligence, pp. 4982–4988.CrossRef Google Scholar

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9, 1735–1780.CrossRef Google Scholar PubMed

Jia, Y. and Ma, X. (2019). Attention in character-based BiLSTM-CRF for Chinese named entity recognition. In Proceedings of the 4th International Conference on Mathematics and Artificial Intelligence. New York: Association for Computing Machinery, pp. 1–4.CrossRef Google Scholar

Ji, Z., Shen, Y., Sun, Y., Yu, T. and Wang, X. (2021). C-CLUE: A benchmark of classical Chinese based on a crowdsourcing system for knowledge graph construction. In China Conference on Knowledge Graph and Semantic Computing. Singapore: Springer, pp. 295–301.CrossRef Google Scholar

Jin, Y., Xie, J., Guo, W., Luo, C., Wu, D. and Wang, R. (2019). LSTM-CRF neural network with gated self attention for Chinese NER. IEEE Access 7, 136694–136703.CrossRef Google Scholar

Kingma, D.P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980.Google Scholar

Lafferty, J.D., McCallum, A. and Pereira, F.C. (2001). Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of 18th International Conference on Machine Learning, (ICML), pp. 282–289.Google Scholar

Leong, K.S., Wong, F., Li, Y. and Dong, M.C. (2008). Chinese tagging based on maximum entropy model. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing, pp. 138–142.Google Scholar

Levow, G.A. (2006). The third international Chinese language processing bakeoff: Word segmentation and named entity recognition. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 108–117.Google Scholar

Li, J., Sun, A., Han, J. and Li, C. (2020). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering 34, 50–70.CrossRef Google Scholar

Li, L., Mao, T., Huang, D. and Yang, Y. (2006). Hybrid models for Chinese named entity recognition. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 72–78.Google Scholar

Li, P.-H., Fu, T.-J. and Ma, W.-Y. (2020). Why attention? analyze BiLSTM deficiency and its remedies in the case of NER. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, California: Association for the Advancement of Artificial Intelligence, pp. 8236–8244.CrossRef Google Scholar

Liu, C.-L., Huang, C.-K., Wang, H. and Bol, P.K. (2015). Mining local gazetteers of literary Chinese with CRF and pattern based methods for biographical information in Chinese history. In Proceedings of the 2015 IEEE International Conference on Big Data. Washington D.C.: IEEE Computer Society, pp. 1629–1638.CrossRef Google Scholar

Long, Y., Xiong, D., Lu, Q., Li, M. and Huang, C.R. (2016). Named entity recognition for Chinese novels in the ming-qing dynasties. In Workshop on Chinese Lexical Semantics. Cham, Switzerland: Springer, pp. 362–375.CrossRef Google Scholar

Luong, M.-T., Socher, R. and Manning, C.D. (2013). Better word representations with recursive neural networks for morphology. In Proceedings of the 17th Conference on Computational Natural Language Learning. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 104–113.Google Scholar

Ma, X. and Hovy, E. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 1064–1074.CrossRef Google Scholar

Meng, Y., Wu, W., Wang, F., Li, X., Nie, P., Yin, F., Li, M., Han, Q., Sun, X. and Li, J. (2019). Glyce: Glyph-vectors for Chinese character representations. In Proceedings of the 33rd Conference on Neural Information Processing Systems. NeurIPS, pp. 2746–2757.Google Scholar

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems. NeurIPS, pp. 3111–3119.Google Scholar

Peng, W., Cheng, H. and Chen, S.-P. (2018). From text to data: Extracting posting data from Chinese local gazetteers. In Proceedings of the 9th International Conference of Digital Archives and Digital Humanities. DADH, pp. 79–125.Google Scholar

Pennington, J., Socher, R. and Manning, C.D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp.1532–1543.CrossRef Google Scholar

Sun, Y., Lin, L., Yang, N., Ji, Z. and Wang, X. (2014). Radical-enhanced Chinese character embedding. In Proceedings of the 21st International Conference on Neural Information Processing. Cham, Switzerland: Springer, pp. 279–286.CrossRef Google Scholar

Tsai, R.T.-H., Wu, S.-H., Lee, C.-W., Shih, C.W. and Hsu, W.L. (2004). Mencius: A Chinese named entity recognizer using the maximum entropy-based hybrid model. International Journal of Computational Linguistics and Chinese Language Processing 9, 65–82.Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems. NeurIPS, pp. 5998–6008.Google Scholar

Watson, R.S. (1986). The named and the nameless: gender and person in Chinese society. American Ethnologist 13, 619–631.CrossRef Google Scholar

Wu, S., Song, X. and Feng, Z. (2021). MECT: Multi-metadata embedding based cross-transformer for Chinese named entity recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 1529–1539.CrossRef Google Scholar

Wu, X., Zhao, H. and Che, C. (2018). Term translation extraction from historical classics using modern chinese explanation. In International Symposium on Natural Language Processing Based on Naturally Annotated Big Data. Cham, Switzerland: Springer, pp. 88–98.CrossRef Google Scholar

Xiong, D., Xu, J., Lu, Q. and Lo, F. (2014). Recognition and extraction of honorifics in Chinese diachronic corpora. In Proceedings of the 15th Workshop on Chinese Lexical Semantics. Cham, Switzerland: Springer, pp. 305–316.CrossRef Google Scholar

Xu, C., Wang, F., Han, J. and Li, C. (2019). Exploiting multiple embeddings for Chinese named entity recognition. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York: Association for Computing Machinery, pp. 2269–2272.CrossRef Google Scholar

Yan, C. and Wang, J. (2020). Exploiting hybrid subword information for Chinese historical named entity recognition. In Proceedings of 2020 IEEE International Conference on Big Data. Washington D.C.: IEEE Computer Society, pp. 4795–4801.CrossRef Google Scholar

Yang, F., Zhang, J., Liu, G., Zhou, J., Zhou, C. and Sun, H. (2018). Five-stroke based CNN-BiRNN-CRF network for Chinese named entity recognition. In Proceedings of the 7th CCF International Conference on Natural Language Processing and Chinese Computing. Cham , Switzerland: Springer, pp. 184–195.CrossRef Google Scholar

Yu, P. and Wang, X. (2021). BERT-based named entity recognition in Chinese Twenty-Four Histories. In International Conference on Web Information Systems and Applications. Cham , Switzerland: Springer, pp. 289–301.Google Scholar

Zhang, Y., Liu, Y., Zhu, J., Zhen, Z., Liu, X., Wang, W., Chen, Z. and Zhai, S. (2019). Learning Chinese word embeddings from stroke, structure and Pinyin of characters. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York: Association for Computing Machinery, pp. 1011–1020.CrossRef Google Scholar

Zhang, Y., Xu, Z. and Zhang, T. (2008). Fusion of multiple features for Chinese named entity recognition based on CRF model. In Asia Information Retrieval Symposium. Cham, Switzerland: Springer, pp. 95–106.CrossRef Google Scholar

Zhou, Y., Huang, L., Guo, T., Hu, S. and Han, J. (2019). An attention-based model for joint extraction of entities and relations with implicit entity features. In Proceedings of the 2019 World Wide Web Conference. New York: Association for Computing Machinery, pp. 729–737.CrossRef Google Scholar

Zhu, Y. and Wang, G. (2019). CAN-NER: Convolutional attention network for Chinese named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 3384–3393.Google Scholar

Article contents

SEN: A subword-based ensemble network for Chinese historical entity extraction

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests