Out-domain Chinese new word detection with statistics-based character embedding

Yuzhi Liang; Min Yang; Jia Zhu; S. M. Yiu

doi:10.1017/S1351324918000463

Out-domain Chinese new word detection with statistics-based character embedding

Published online by Cambridge University Press: 11 February 2019

Yuzhi Liang ,

Min Yang ,

Jia Zhu and

S. M. Yiu

Show author details

Yuzhi Liang: Affiliation:
Department of Information Engineering, Peking University Shenzhen Graduate School, Shenzhen, China
Min Yang: Affiliation:
Frontier Science and Technology Research Centre, Shenzhen Institutes of Advanced Technology, Shenzhen, China
Jia Zhu*: Affiliation:
Department of Computer Science, South China Normal University, Guangzhou, China
S. M. Yiu: Affiliation:
Department of Computer Science, The University of Hong Kong, Hong Kong, China
*: *Corresponding author. Email: jzhu@m.scnu.edu.cn

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Unlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps in processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., “B” for word beginning, and “E” for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which high-quality in-domain training data are not available. An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method Edge Likelihood (EL) for Chinese word boundary detection. Then we propose a domain-independent Chinese new word detector (DICND); each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector.

Keywords

Chinese character embedding Chinese new word detection Chinese word boundary detection

Type: Article
Information: Natural Language Engineering , Volume 25 , Issue 2 , March 2019 , pp. 239 - 255

DOI: https://doi.org/10.1017/S1351324918000463 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Cai, D. and Zhao, H. (2016). Neural word segmentation learning for Chinese. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin: Association for Computational Linguistics (ACL), pp. 409–420.Google Scholar

Cai, D., Zhao, H., Zhang, Z., Xin, Y., Wu, Y. and Huang, F. (2017). Fast and accurate neural word segmentation for Chinese. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver: Association for Computational Linguistics (ACL), pp. 608–615.CrossRef Google Scholar

Chang, P.C., Galley, M. and Manning, C.D. (2008). Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the Third Workshop on Statistical Machine Translation, Madison: Omnipress Inc., pp. 224–232.CrossRef Google Scholar

Chen, X., Qiu, X., Zhu, C., Liu, P. and Huang, X. (2015). Long short-term memory neural networks for Chinese word segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Lisbon: Association for Computational Linguistics (ACL), pp. 1197–1206.CrossRef Google Scholar

Eddy, S.R. (1996). Hidden Markov models. Current Opinion in Structural Biology, 6, 361–365.CrossRef Google Scholar PubMed

Feng, H., Chen, K., Kit, C. and Deng, X. (2004). Unsupervised segmentation of Chinese corpus using accessor variety. In International Conference on Natural Language Processing, India: NLP Association of India, pp. 694–703.Google Scholar

Gao, Q. and Vogel, S. (2010). A multi-layer Chinese word segmentation system optimized for out-of-domain tasks. In Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2010), Beijing, Chinese Information Processing Society of China, pp. 210–215.Google Scholar

Huang, M., Ye, B., Wang, Y., Chen, H., Cheng, J. and Zhu, X. (2014). New word detection for sentiment analysis. In ACL (1), Baltimore: Association for Computational Linguistics (ACL), pp. 531–541.Google Scholar

Jin, Z. and Tanaka-Ishii, K. (2006). Unsupervised segmentation of Chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, Sydney: Association for Computational Linguistics (ACL), pp. 428–435.CrossRef Google Scholar

Kityz, C. and Wilks, Y. (1999). Unsupervised learning of word boundary with description length gain. In Proceedings of the CoNLL99 ACL Workshop, Bergen: Association for Computational Linguistics (ACL), pp. 1–6.Google Scholar

Lafferty, J., McCallum, A. and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, San Francisco, Morgan Kaufmann Publishers Inc., pp. 282–289.Google Scholar

Leng, Y., Liu, W., Wang, S. and Wang, X. (2016). A feature-rich CRF segmenter for Chinese micro-blog. In International Conference on Computer Processing of Oriental Languages, Kunming, Springer LNAI, pp. 854–861.Google Scholar

Li, Y., Li, W., Sun, F. and Li, S. (2015). Component-enhanced Chinese character embeddings. arXiv preprint arXiv:1508.06669.Google Scholar

Liu, Y., Zhang, Y., Che, W., Liu, T. and Wu, F. (2014). Domain adaptation for CRF-based Chinese word segmentation using free annotations. In EMNLP, Doha: Association for Computational Linguistics (ACL), pp. 864–874.Google Scholar

Luo, S. and Sun, M. (2003). Two-character Chinese word extraction based on a hybrid of internal and contextual measures. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Sapporo: Association for Computational Linguistics, vol. 17, 24–30.CrossRef Google Scholar

McCallum, A., Freitag, D. and Pereira, F.C. (2000). Maximum entropy Markov models for information extraction and segmentation. In ICML, California, Morgan Kaufmann Inc., vol. 17, pp. 591–598.Google Scholar

Miao, C.-J. and Chen, X.-M. (2011) The Interpretation of Modern Chinese Verbs. Beijing Normal University Press, pp. 3–22.Google Scholar

Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.Google Scholar

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, pp. 3111–3119.Google Scholar

Pei, W., Ge, T. and Chang, B. (2014). Max-margin tensor neural network for Chinese word segmentation. In ACL (1), Baltimore: Association for Computational Linguistics (ACL), pp. 293–303.Google Scholar

Peng, F., Feng, F. and McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th international conference on Computational Linguistics, Barcelona: Association for Computational Linguistics (ACL), p. 562.Google Scholar

Qian, P., Qiu, X. and Huang, X. (2016). A new psychometric-inspired evaluation metric for Chinese word segmentation. In Proceedings of the 54th international conference on Computational Linguistics, Berlin: Association for Computational Linguistics (ACL), vol. 1, pp. 2185–2194.Google Scholar

Qiu, X., Qian, P. and Shi, Z. (2016). Overview of the NLPCC-ICCPOL 2016 shared task: Chinese word segmentation for micro-blog texts. In International Conference on Computer Processing of Oriental Languages, Kunming: Springer LNAI, pp. 901–906.Google Scholar

Sproat, R. and Emerson, T. (2003). The second international Chinese word segmentation bakeoff. In Proceeding of the Sighan Workshop on Chinese Language, Sapporo: Association for Computational Linguistics, pp. 133–143.CrossRef Google Scholar

Sun, Y., Lin, L., Yang, N., Ji, Z. and Wang, X. (2014). Radical-enhanced Chinese character embedding. In International Conference on Neural Information Processing, Montreal: Neural Information Processing Systems Foundation, Inc., pp. 279–286.Google Scholar

Wang, L.Y., Wong, F., Chao, S. and Xing, J.W. (2012). CRFs-based Chinese word segmentation for micro-blog with small-scale data. In Association for Computational Linguistics, Tianjin: Association for Computational Linguistics, pp. 51–57.Google Scholar

Wang, Y., Jun’ichi Kazama, Y.T., Tsuruoka, Y., Chen, W., Zhang, Y. and Torisawa, K. (2011). Improving Chinese word segmentation and POS tagging with semi-supervised methods using large auto-analyzed data. In IJCNLP, Chiang Mai: Asian Federation of Natural Language Processing, pp. 309–317.Google Scholar

Xia, Q., Li, Z., Chao, J. and Zhang, M. (2016). Word segmentation on micro-blog texts with external lexicon and heterogeneous data. In International Conference on Computer Processing of Oriental Languages, Kunming: Springer LNAI, pp. 711–721.Google Scholar

Xue, N. (2003). Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8(1), 29–48.Google Scholar

Yao, Y. and Huang, Z. (2016). Bi-directional LSTM recurrent neural network for Chinese word segmentation. In International Conference on Neural Information Processing, Barcelona: Neural Information Processing Systems Foundation, Inc., pp. 345–353.CrossRef Google Scholar

Zhang, H.P., Yu, H.K., Xiong, D.Y. and Liu, Q. (2003). HHMM-based Chinese lexical analyzer ICTCLAS. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Sapporo: Association for Computational Linguistics, vol. 17, pp. 184–187.CrossRef Google Scholar

Zhang, K., Sun, M. and Zhou, C. (2012a). Word segmentation on Chinese microblog data with a linear-time incremental model. In Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, Tianjin: Association for Computational Linguistics, pp. 41–46.Google Scholar

Zhang, M., Deng, Z., Che, W. and Liu, T. (2012b). Combining statistical model and dictionary for domain adaption of Chinese word segmentation. Journal of Chinese Information Processing 26(2), 8–12.Google Scholar

Zhang, M., Zhang, Y. and Fu, G. (2016). Transition-based neural word segmentation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin: Association for Computational Linguistics (ACL), pp. 421–431.Google Scholar

Zhang, R., Yasuda, K. and Sumita, E. (2008). Chinese word segmentation and statistical machine translation. ACM Transactions on Speech and Language Processing (TSLP) 5(2), 4.Google Scholar

Zhang, Y. and Clark, S. (2007). Transition-based parsing of the Chinese Treebank using a global discriminative model. In IWPT ’09 Proceedings of the 11th International Conference on Parsing Technologies, Paris: Association for Computational Linguistics (ACL), pp. 162–171.Google Scholar

Zheng, X., Chen, H. and Xu, T. (2013). Deep learning for Chinese word segmentation and POS tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Seattle: Association for Computational Linguistics (ACL), pp. 647–657.Google Scholar

Article contents

Out-domain Chinese new word detection with statistics-based character embedding

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests