Skip to main content Accessibility help
×
×
Home

Out-domain Chinese new word detection with statistics-based character embedding

  • Yuzhi Liang (a1), Min Yang (a2), Jia Zhu (a3) and S. M. Yiu (a4)

Abstract

Unlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps in processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., “B” for word beginning, and “E” for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which high-quality in-domain training data are not available. An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method Edge Likelihood (EL) for Chinese word boundary detection. Then we propose a domain-independent Chinese new word detector (DICND); each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector.

Copyright

Corresponding author

*Corresponding author. Email: jzhu@m.scnu.edu.cn

References

Hide All
Cai, D. and Zhao, H. (2016). Neural word segmentation learning for Chinese. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin: Association for Computational Linguistics (ACL), pp. 409420.
Cai, D., Zhao, H., Zhang, Z., Xin, Y., Wu, Y. and Huang, F. (2017). Fast and accurate neural word segmentation for Chinese. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver: Association for Computational Linguistics (ACL), pp. 608615.
Chang, P.C., Galley, M. and Manning, C.D. (2008). Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the Third Workshop on Statistical Machine Translation, Madison: Omnipress Inc., pp. 224232.
Chen, X., Qiu, X., Zhu, C., Liu, P. and Huang, X. (2015). Long short-term memory neural networks for Chinese word segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Lisbon: Association for Computational Linguistics (ACL), pp. 11971206.
Eddy, S.R. (1996). Hidden Markov models. Current Opinion in Structural Biology, 6, 361365.
Feng, H., Chen, K., Kit, C. and Deng, X. (2004). Unsupervised segmentation of Chinese corpus using accessor variety. In International Conference on Natural Language Processing, India: NLP Association of India, pp. 694703.
Gao, Q. and Vogel, S. (2010). A multi-layer Chinese word segmentation system optimized for out-of-domain tasks. In Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2010), Beijing, Chinese Information Processing Society of China, pp. 210215.
Huang, M., Ye, B., Wang, Y., Chen, H., Cheng, J. and Zhu, X. (2014). New word detection for sentiment analysis. In ACL (1), Baltimore: Association for Computational Linguistics (ACL), pp. 531541.
Jin, Z. and Tanaka-Ishii, K. (2006). Unsupervised segmentation of Chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, Sydney: Association for Computational Linguistics (ACL), pp. 428435.
Kityz, C. and Wilks, Y. (1999). Unsupervised learning of word boundary with description length gain. In Proceedings of the CoNLL99 ACL Workshop, Bergen: Association for Computational Linguistics (ACL), pp. 16.
Lafferty, J., McCallum, A. and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, San Francisco, Morgan Kaufmann Publishers Inc., pp. 282289.
Leng, Y., Liu, W., Wang, S. and Wang, X. (2016). A feature-rich CRF segmenter for Chinese micro-blog. In International Conference on Computer Processing of Oriental Languages, Kunming, Springer LNAI, pp. 854861.
Li, Y., Li, W., Sun, F. and Li, S. (2015). Component-enhanced Chinese character embeddings. arXiv preprint arXiv:1508.06669.
Liu, Y., Zhang, Y., Che, W., Liu, T. and Wu, F. (2014). Domain adaptation for CRF-based Chinese word segmentation using free annotations. In EMNLP, Doha: Association for Computational Linguistics (ACL), pp. 864874.
Luo, S. and Sun, M. (2003). Two-character Chinese word extraction based on a hybrid of internal and contextual measures. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Sapporo: Association for Computational Linguistics, vol. 17, 2430.
McCallum, A., Freitag, D. and Pereira, F.C. (2000). Maximum entropy Markov models for information extraction and segmentation. In ICML, California, Morgan Kaufmann Inc., vol. 17, pp. 591598.
Miao, C.-J. and Chen, X.-M. (2011) The Interpretation of Modern Chinese Verbs. Beijing Normal University Press, pp. 322.
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, pp. 31113119.
Pei, W., Ge, T. and Chang, B. (2014). Max-margin tensor neural network for Chinese word segmentation. In ACL (1), Baltimore: Association for Computational Linguistics (ACL), pp. 293303.
Peng, F., Feng, F. and McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th international conference on Computational Linguistics, Barcelona: Association for Computational Linguistics (ACL), p. 562.
Qian, P., Qiu, X. and Huang, X. (2016). A new psychometric-inspired evaluation metric for Chinese word segmentation. In Proceedings of the 54th international conference on Computational Linguistics, Berlin: Association for Computational Linguistics (ACL), vol. 1, pp. 21852194.
Qiu, X., Qian, P. and Shi, Z. (2016). Overview of the NLPCC-ICCPOL 2016 shared task: Chinese word segmentation for micro-blog texts. In International Conference on Computer Processing of Oriental Languages, Kunming: Springer LNAI, pp. 901906.
Sproat, R. and Emerson, T. (2003). The second international Chinese word segmentation bakeoff. In Proceeding of the Sighan Workshop on Chinese Language, Sapporo: Association for Computational Linguistics, pp. 133143.
Sun, Y., Lin, L., Yang, N., Ji, Z. and Wang, X. (2014). Radical-enhanced Chinese character embedding. In International Conference on Neural Information Processing, Montreal: Neural Information Processing Systems Foundation, Inc., pp. 279286.
Wang, L.Y., Wong, F., Chao, S. and Xing, J.W. (2012). CRFs-based Chinese word segmentation for micro-blog with small-scale data. In Association for Computational Linguistics, Tianjin: Association for Computational Linguistics, pp. 5157.
Wang, Y., Jun’ichi Kazama, Y.T., Tsuruoka, Y., Chen, W., Zhang, Y. and Torisawa, K. (2011). Improving Chinese word segmentation and POS tagging with semi-supervised methods using large auto-analyzed data. In IJCNLP, Chiang Mai: Asian Federation of Natural Language Processing, pp. 309317.
Xia, Q., Li, Z., Chao, J. and Zhang, M. (2016). Word segmentation on micro-blog texts with external lexicon and heterogeneous data. In International Conference on Computer Processing of Oriental Languages, Kunming: Springer LNAI, pp. 711721.
Xue, N. (2003). Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8(1), 2948.
Yao, Y. and Huang, Z. (2016). Bi-directional LSTM recurrent neural network for Chinese word segmentation. In International Conference on Neural Information Processing, Barcelona: Neural Information Processing Systems Foundation, Inc., pp. 345353.
Zhang, H.P., Yu, H.K., Xiong, D.Y. and Liu, Q. (2003). HHMM-based Chinese lexical analyzer ICTCLAS. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Sapporo: Association for Computational Linguistics, vol. 17, pp. 184187.
Zhang, K., Sun, M. and Zhou, C. (2012a). Word segmentation on Chinese microblog data with a linear-time incremental model. In Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, Tianjin: Association for Computational Linguistics, pp. 4146.
Zhang, M., Deng, Z., Che, W. and Liu, T. (2012b). Combining statistical model and dictionary for domain adaption of Chinese word segmentation. Journal of Chinese Information Processing 26(2), 812.
Zhang, M., Zhang, Y. and Fu, G. (2016). Transition-based neural word segmentation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin: Association for Computational Linguistics (ACL), pp. 421431.
Zhang, R., Yasuda, K. and Sumita, E. (2008). Chinese word segmentation and statistical machine translation. ACM Transactions on Speech and Language Processing (TSLP) 5(2), 4.
Zhang, Y. and Clark, S. (2007). Transition-based parsing of the Chinese Treebank using a global discriminative model. In IWPT ’09 Proceedings of the 11th International Conference on Parsing Technologies, Paris: Association for Computational Linguistics (ACL), pp. 162171.
Zheng, X., Chen, H. and Xu, T. (2013). Deep learning for Chinese word segmentation and POS tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Seattle: Association for Computational Linguistics (ACL), pp. 647657.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Keywords

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed