Hostname: page-component-7c8c6479df-995ml Total loading time: 0 Render date: 2024-03-29T12:59:43.278Z Has data issue: false hasContentIssue false

Classification of regional and genre varieties of Chinese: A correspondence analysis approach based on comparable balanced corpora

Published online by Cambridge University Press:  09 March 2020

Renkui Hou
Affiliation:
College of Humanities, Guangzhou University, Guangzhou, China Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong
Chu-Ren Huang*
Affiliation:
Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong
*
*Corresponding author. E-mail: churen.huang@polyu.edu.hk

Abstract

This paper proposes a robust text classification and correspondence analysis approach to identification of similar languages. In particular, we propose to use the readily available information of clauses and word length distribution to model similar languages. The modeling and classification are based on the hypothesis that languages are self-adaptive complex systems and hence can be classified by dynamic features describing the system, especially in terms of distributional relations of constituents of a system. For similar languages whose grammatical differences are often subtle, classification based on dynamic system features should be more effective. To test this hypothesis, we considered both regional and genre varieties of Mandarin Chinese for classification. The data are extracted from two comparable balanced corpora to minimize possible confounding factors. The two corpora are the Sinica Corpus from Taiwan and the Lancaster Corpus of Mandarin Chinese from Mainland China, and the two genres are reportage and review. Our text classification and correspondence analysis results show that the linguistically felicitous two-level constituency model combining power functions between word and clauses effectively classifies the two varieties of Chinese for both genres. In addition, we found that genres do have compounding effect on classification of regional varieties. In particular, reportage in two varieties is more likely to be classified than review, corroborating the complex system view of language variations. That is, language variations and changes typically do not take place evenly across the board for the complete language system. This further enhances our hypothesis that dynamic complex system features, such as the power functions captured by the Menzerath–Altmann law, provide effective models in classifications of similar languages.

Type
Article
Copyright
© Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Altmann, G. (1993). Science and linguistics. In Köhler R. and Rieger, B.B (eds.), Contributions to quantitative linguistics. Dordrecht: Springer, pp. 310.CrossRefGoogle Scholar
Baayen, R.H. (2008). Analyzing Linguistics Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Beckner, C., Blythe, R., Bybee, J., Christiansen, M.H., Croft, W., Ellis, N.C., Holland, J., Ke, J., Larsen-Freeman, D. and Schoenemann, T. (2009). Language is a complex adaptive system: Position paper. Language learning 59, 126.Google Scholar
Best, K.-H. (2002). The distribution of rhythmic units in German short prose. Glottometrics 3, 136142.Google Scholar
Best, K.H. (2005). Quantitative Linguistics-An International Handbook, chapter Satzlänge (Sentence length), pages 298–304, de Gruyter.Google Scholar
Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Chao, Y.R. (1968). A Grammar of Spoken Chinese. Berkeley and Los Angeles: University of California Press.Google Scholar
Chen, H.H. (1994). The contextual analysis of Chinese sentences with punctuation marks. Literary and linguistic computing 9(4), 281289.CrossRefGoogle Scholar
Chen, K.-J., Luo, C.-C., Chang, M.-C., Chen, F.-Y., Chen, C.-J., Huang, C.-R. & Gao, Z.-M. (2003). Sinica Treebank: Design Criteria, Representational Issues and Implementation. In Abeilleé A (ed), Treebanks: Building and Using Parsed Corpora. Dordrecht; Boston: Kluwer Academic Publishers, pp. 231248.CrossRefGoogle Scholar
Chen, K.-J., Huang, C.-R., Chang, L.-P. and Hsu, H.-L. (1996). Sinica Corpus: Design Methodology for Balanced Corpora. In Park, B-S and Kim, JB (eds), Proceeding of the 11th Pacific Asia Conference on Language, Information and Computation. Seoul: Kyung Hee University. pp. 167176.Google Scholar
Christensen, M. (1994). Varation in Spoken and Written Mandarin Narrative Discourse. Ph.D. thesis. Ohio State University, Columbus.Google Scholar
Deng, Y. & Feng, Z. (2013). A quantitative linguistic study on the relationship between word length and word frequency. Journal of Foreign Language 36(3), 2939.Google Scholar
Eroglu, S. (2014). Language-like behavior of protein length distribution in proteomes. Complexity 20(2), 1221.Google Scholar
Feng, Z. (2012). 用计量方法研究语言. Foreign Language Teaching and Research 44(2), 256269.Google Scholar
Ferrer-I-Cancho, R. and Núria, F. (2010). The self-organization of genomes. Complexity 15(5), 3436.Google Scholar
Ferrer-I-Cancho, R., Núria, F., Antoni, H.-F., Gemma, B. and Baixeries, J. (2012). The challenges of statistical patterns of language: The case of Menzerath’s law in genomes. Complexity 18(3), 1117.CrossRefGoogle Scholar
Grzybek, P. (2007). History and methodology of word length studies. In Contributions to the Science of Text and Language. Dordrecht: Springer, pp. 1590.Google Scholar
Hong, J.-F., and Huang, C.-R. (2008). 語料庫為本的兩岸對應詞彙發掘. (A corpus-based approach to the discovery of cross-strait lexical contrasts). Language and Linguistics 9(2), 221238.Google Scholar
Hong, J.-F. and Huang, C.-R. (2013). 以中文十億詞語料庫為基礎之兩岸詞彙對比研究 (Cross-strait lexical differences: A comparative study based on Chinese Gigaword Corpus). Computational Linguistics and Chinese Language Processing 18(2), 1934.Google Scholar
Hou, R., Huang, C.-R., Do, H.S. and Liu, H. (2017). A study on correlation between Chinese sentence and constituting clauses based on the Menzerath-Altmann Law. Journal of Quantitative Linguistics 24(4), 350366. Published online: 26 Apr 2017. http://dx.doi.org/10.1080/09296174.2017.1314411 CrossRefGoogle Scholar
Hou, R., Huang, C.R. and Liu, H. (2019). A study on Chinese register characteristics based on regression analysis and text clustering. Corpus Linguistics and Linguistic Theory 15(1), 137. https://doi.org/10.1515/cllt-2016-006 CrossRefGoogle Scholar
Hou, R. and Huang, C.-R. 2019. Robust stylometric analysis and author attribution based on tones and rimes. Journal of Natural Language Engineering. Online First View. https://doi.org/10.1017/S135132491900010X CrossRefGoogle Scholar
Hou, R., Huang, C.-R., Ahrens, K. and Lee, Y.S. (2019). Linguistic characteristics of Chinese register based on the Menzerath-Altmann law and text clustering. Digital Scholarship in the Humanities. https://doi.org/10.1093/llc/fqz005 Google Scholar
Hou, R., Yang, J. and Jiang, M. (2014). A study on Chinese quantitative stylistic features and relation among different stylesbased on text clustering. Journal of Quantitative Linguistics 21(3), 246280.CrossRefGoogle Scholar
Hu, H., Li, W., Zhou, H., Tian, Z., Zhang, Y., and Zou, L. (2019). Ensemble Methods to Distinguish Mainland and Taiwan Chinese. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 165–171.Google Scholar
Huang, C.R. (1989). On the mathematical properties of Mandarin Chinese 試論漢語的數學規範性質. Bulletin of the Institute of History and Philology 60, 4773.Google Scholar
Huang, C.R., Chen, K.J., and Gao, Z.M. (1998). Noun class extraction from a corpus-based collocation dictionary: An integration of computational and qualitative approaches. In B. Tsou et al. (eds.), Quantitative and Computational Studies of Chinese Linguistics 339352. Hong Kong: City University of Hong Kong.Google Scholar
Huang, C.-R. and Lee, L.H. (2008). Contrastive approach towards text source classification based on top-bag-of-word similarity. In Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation (pp. 404–410).Google Scholar
Huang, C.-R., Lin, J., Jiang, M. and Xu, H. (2014). Corpus-based study and identification of Mandarin Chinese light verb variations. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (pp. 1–10).CrossRefGoogle Scholar
Huang, C.-R. and Lin, J. (2013). The ordering of Mandarin Chinese light verbs. In Proceedings of the 13th Chinese Lexical Semantics Workshop. D. Ji and G. Xiao (Eds.) CLSW 2012, LNAI 7717, pages 728–735. Heidelberg: Springer.Google Scholar
Huang, C.-R. and Shi, D. (2016). A Reference Grammar of Chinese. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Jauhiainen, T., Lindén, K. and Jauhiainen, H. (2019). Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 178–187.Google Scholar
Ke, J., Minett, J.W., Au, C.P. and Wang, W.S.Y. (2002). Self-organization and selection in the emergence of vocabulary. Complexity 7(3), 4154.CrossRefGoogle Scholar
Köhler, R. (2012). Quantitative syntax analysis, vol. 65. Berlin: Walter de Gruyter.CrossRefGoogle Scholar
Kroch, T. (1994). Morphosyntactic variation. In Beals K., Denton J., Knippen R., Melnar L., Suzuki H. and Zeinfeld E. (eds.), Papers from the 30th regional meeting of the Chicago Linguistics Society: Parasession on variation and linguistic theory, vol. 2. Chicago: Chicago Linguistics Society, pp. 180201.Google Scholar
Krug, M., Schlüter, J. and Rosenbach, A. (2013). Introduction. Investigating language variation and change. In Krug M and Schlüter J (eds), Research methods in language variation and change. Cambridge: Cambridge University Press, pp. 114.CrossRefGoogle Scholar
Labov, W. (1969). Contraction, deletion, and inherent variability of the English copula. Language 45(4), 715762.CrossRefGoogle Scholar
Li, W. (2011). Menzerath’s Law at the gene-exon level in the human genome. Complexity 17(4), 4953.CrossRefGoogle Scholar
Lin, J., Shi, D, Jiang, M. and Huang, C.-R. (2018). Variations in World Chineses. In Huang C.-R. et al. (eds), Routledge Handbook on Chinese Applied Linguistics. London: Routledge.Google Scholar
Liu, Y. and Hu, F. (2011). A comparative study of stylistics between “Reading News” and “Talking News”. Language Teaching and Linguistic Studies 1, 97104.Google Scholar
Liu, H. and Huang, W. (2012). Quantitative linguistics: State of the art, theories and methods. Journal of Zhejiang University (Humanities and Social Sciences) 42(2). 178192.Google Scholar
Lu, J. (1993). The features of Chinese sentences. Chinese Language Learning No.1, 1–6.Google Scholar
Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and arabic dialect identification: A report on the third dsl shared task. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1–14.Google Scholar
Neergaard, K.D. and Huang, C.-R. (2019). Constructing the Mandarin phonological network: Novel syllable inventory used to identify schematic segmentation. Complexity, 2019.CrossRefGoogle Scholar
McEnery, A. and Xiao, Z. (2004). The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study. Religion 17, 34.Google Scholar
Pande, H. and Dhami, H.S. (2015). Determination of the distribution of sentence length frequencies for Hindi language texts and utilization of sentence length frequency profiles for authorship attribution. Journal of Quantitative Linguistics 22(4), 338348.CrossRefGoogle Scholar
Pawłowski, A. and Eder, M. (2015). Sequential Structures in “Dalimil’s Chronicle”. In Mikros G.K. and Ján M. (eds), Sequences in Language and Text. Berlin & Boston: Walter de Gruyter GmbH, 69, 147167.Google Scholar
Peirsman, Y., Geeraerts, D. and Speelman, D. (2010). The automatic identification of lexical variation between language varieties. Natural Language Engineering 16(4), 469491.Google Scholar
Popescu, I.I., Best, K.H. and Altmann, G. (2014). Unified modeling of length in language. Language 2, 124.Google Scholar
R Core Team. (2016). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.Google Scholar
Sigurd, B., Eeg-Olofsson, M. and Van, W.J. (2004). Word length, sentence length and frequency–Zipf revisited. Studia Linguistica 58(1), 3752.CrossRefGoogle Scholar
Stamatatos, E., Fakotakis, N. and Kokkinakis, G. (2001). Automatic text categorization in terms of genre and author. Computational Linguistics 26(4), 471495.CrossRefGoogle Scholar
Štajner, S. and Mitkov, R. (2012). Style of religious texts in 20th century. In proceedings of the Workshop on Language Resource and Evaluation for Religious Texts (LRE-Rel), held in conjunction with LREC 2012, pp. 81–7. 23 May. Istanbul, Turkey.Google Scholar
Thomason, S. (1997). Language Variation and Change. In Nunberg G and Wasow T (eds), The Fields of Linguistics. Washington, DC: The Linguistic Society of America. Accessed at https://www.linguisticsociety.org/resource/language-variation-and-change Google Scholar
Wang, K., and Qin, H. (2014). What is peculiar to translational Mandarin Chinese? A corpus-based study of Chinese constructions’ load capacity. Corpus Linguistics and Linguistic Theory 10(1), 5777.CrossRefGoogle Scholar
Wang, T. and Li, X. (1996). 兩岸詞彙比較研究管見 (Research on lexical differences between Mainland and Taiwan Mandarin), World Chinese (〈華文世界〉), volume 81.Google Scholar
Wang, W.S.-Y. 王士元. (2006). Language is a complex adaptive system语言是一个复杂适应系统. Journal of Tsinghua University (Philosophy and Social Science) 21(6), 513.Google Scholar
Wang, W.S.-Y. (1969). Competing changes as a cause of residue. Language 45(1), 925.CrossRefGoogle Scholar
Wimmer, G. and Altmann, G. (2005). Unified derivation of some linguistic laws. In Köhler R., Altmann G, Piotrowski R.G. (eds), Quantitative Linguistics. An International Handbook. Berlin: de Gruyter, pp. 791807.Google Scholar
Wimmer, G. and Altmann, G. (2007). Towards a unified derivation of some linguistic laws. In Grzybek, P. (ed), Contributions to the Science of Text and Language. Springer, Dordrecht, pp. 329337.Google Scholar
Xu, D. (1995). 兩岸詞語差異之比較 (Lexical difference between Mainland and Taiwan Chinese). 1 st symposium on Cross-Strait Lexical and Character differences (第一屆兩岸漢語語彙文字學術研討 會論文集Google Scholar
Xu, H., Jiang, M., Lin, J., Shi, D. and Huang, C.-R. (2019). Light Verb Variations in Varieties of Chinese: Comparable Corpus Driven Approaches to Processing of Similar Languages. To appear in Zampieri and Nakov (2019). Similar Languages, Varieties, and Dialects: A Computational Perspective. Cambridge: Cambridge University Press.Google ScholarPubMed
Xu, H., Jiang, M., Lin, J. and Huang, C.-R. (2020). Light Verb Variations and Varieties of Mandarin Chinese: Comparable Corpus Driven Approaches to Grammatical Variations. To Appear in Corpus Linguistics and Linguistic Theory.CrossRefGoogle Scholar
Zampieri, M. and Preslav, N. (eds) (2019). Similar Languages, Varieties, and Dialects: A Computational Perspective. In Studies in Natural Language Processing book series. Cambridge: Cambridge University Press.Google Scholar
Zampieri, M., Malmasi, S., Scherrer, Y., Samardžic, T., Tyers, F., Silfverberg, M.P., Klyueva, N, Pan, T.L., Huang, C.R., Ionescu, R.T., Butnaru, A. (2019). A Report on the Third VarDial Evaluation Campaign. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2019). Association for Computational Linguistics, pp.1–16.Google Scholar
Zampieri, M., Tan, L., Ljubešić, N. and Tiedemann, J. (2014). Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects.CrossRefGoogle Scholar
Zeng, R. (1995). 兩岸語言詞彙整理之我見 (Opinion on cross-Strait language differences)1 st symposium on Cross-Strait Lexical and Character differences 第一屆兩岸漢語語彙文字學術研討會 論文集Google Scholar
Zhu, D. (1982). Lectures on Grammar. Beijing, China: Commercial Press.Google Scholar
Zipf, G.K. (1935). The Psycho-Biology of Language: An Introduction to Dynamic Philology. Oxford, England: Houghton, Mifflin.Google Scholar