Comparing Corpus-Driven and Corpus-Based Approaches to Diachronic Variation

doi:10.1017/9781108589314.011

10 - Comparing Corpus-Driven and Corpus-Based Approaches to Diachronic Variation

Grammatical Changes in Late Modern and Present-Day English

from Part IV - Applications of Classification-Based Approaches

Published online by Cambridge University Press: 06 May 2022

Gerold Schneider

Edited by

Ole Schützler and

Julia Schlüter

Show author details

Ole Schützler: Affiliation:
Universität Leipzig
Julia Schlüter: Affiliation:
Universität Bamberg

Book contents

Get access

Summary

Focusing on grammatical changes in Late Modern and Present-Day English, the author applies a corpus-driven method to texts from two diachronic corpora, the Representative Corpus of Historical English Registers (ARCHER) and the Corpus of Historical American English (COHA). He compares his findings to those returned by more conventional corpus-based methods, which can be characterized as hypothesis-driven. To this purpose, the study employs automated profiling of large feature sets, such as word- and POS-based mono-, bi- and trigrams, chunks, syntactic dependency labels and measures of constituent order and length. The derived feature profiles are combined in a supervised classification task with a given division of texts into earlier and later corpus subperiods to reveal patterns of over- and underuse. Structures that profiled as over- or under-represented in the diachronic subsections are then browsed for grammatical changes that may have been missed by previous research. According to the author, an advantage of such approaches is that they are theory-neutral and may generate novel hypotheses for investigation. These may then serve as input to further corpus-based approaches.

Keywords

data-driven machine-learning supervised classification feature weight POS tagging parsing

Type: Chapter
Information: Data and Methods in Corpus Linguistics
Comparative Approaches
, pp. 291 - 322

DOI: https://doi.org/10.1017/9781108589314.011 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2022

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aarts, Bas. 1992. Comments. In Svartvik, Jan, ed. Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82. Stockholm, 4–8 August 1991. Berlin: Mouton de Gruyter. 180–3.Google Scholar

Aarts, Bas. 2019. Syntactic Argumentation. In Aarts, Bas, Jill, Bowie and Popova, Gergana, eds. The Oxford Handbook of English Grammar. Oxford: Oxford University Press.CrossRef Google Scholar

Abney, Steven. 1995. Chunks and Dependencies: Bringing Processing Evidence to Bear on Syntax. In Cole, Jennifer, Green, Georgia and Morgan, Jerry, eds. Computational Linguistics and the Foundations of Linguistic Theory. Chicago: University of Chicago Press. 145–64.Google Scholar

Ananiadou, Sophia, Kell, Douglas B. and Tsujii, Jun-Ichi. 2006. Text Mining and Its Potential Applications in Systems Biology. Trends in Biotechnology 24(12). 571–9.CrossRef Google Scholar PubMed

Anderson, Chris. 2008. The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired Magazine 06.2008. www.wired.com/2008/06/pb-theory (accessed 20 February 2020).Google Scholar

Arppe, Antti, Gilquin, Gaëtanelle, Glynn, Dylan, Hilpert, Martin and Zeschel, Arne. 2010. Cognitive Corpus Linguistics: Five Points of Debate on Current Theory and Methodology. Corpora 5(1). 1–27.Google Scholar

Baron, Alistair, and Rayson, Paul. 2008. VARD 2: A Tool for Dealing with Spelling Variation in Historical Corpora. In Proceedings of the Postgraduate Conference in Corpus Linguistics. Birmingham: Aston University. http://ucrel.lancs.ac.uk/people/paul/publications/BaronRaysonAston2008.pdf.Google Scholar

Biber, Douglas. 2003. Compressed Noun-Phrase Structures in Newspaper Discourse: The Competing Demands of Popularization vs. Economy. In Aitchison, Jean and Lewis, Diana M., eds. New Media Language. London: Routledge. 169–81.Google Scholar

Biber, Douglas, Finegan, Edward and Atkinson, Dwight. 1994. ARCHER and Its Challenges: Compiling and Exploring a Representative Corpus of Historical English Registers. In Fries, Udo, Tottie, Gunnel and Schneider, Peter, eds. Creating and Using English Language Corpora: Papers from the 14th International Conference on English Language Research on Computerized Corpora, Zurich 1993. 1–13.Google Scholar

Biber, Douglas, and Conrad, Susan. 2009. Register, Genre, and Style. Cambridge: Cambridge University Press.Google Scholar

Brants, Thorsten. 2020. Inter-Annotator Agreement for a German Newspaper Corpus. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00). Luxembourg: European Language Resources Association (ELRA).Google Scholar

Denison, David. 1998. Syntax. In Romaine, Suzanne, ed. The Cambridge History of the English Language, vol. 4, 1776–1997. Cambridge: Cambridge University Press. 92–329.Google Scholar

Dorman, Carsten F., Elith, Jane, Bacher, Sven et al. 2013. Collinearity: A Review of Methods to Deal with It and a Simulation Study Evaluating Their Performance. Ecography 36. 27–46. https://damariszurell.github.io/files/Dormann_etal_Ecography_2013.pdf.CrossRef Google Scholar

Evert, Stefan. 2006. How Random Is a Corpus? The Library Metaphor. Zeitschrift für Anglistik und Amerikanistik 54(2). 177–90.Google Scholar

Finn, Aidan, and Kushmerick, Nicolas. 2003. Learning to Classify Documents According to Genre. Proceedings of IJCAI-03 Workshop on Computational Approaches to Style Analysis and Synthesis. www.aidanf.net/publications/finn03learninggenre.pdf.Google Scholar

Gries, Stefan T. 2010. Corpus Linguistics and Theoretical Linguistics: A Love-Hate Relationship? Not Necessarily … International Journal of Corpus Linguistics 15(3). 327–43.CrossRef Google Scholar

Gries, Stefan T. 2015. The Most Under-Used Statistical Method in Corpus Linguistics: Multi-Level (and Mixed-Effects) Models. Corpora 10(1). 95–125.Google Scholar

Gries, Stefan Th., and Hilpert, Martin. 2008. The Identification of Stages in Diachronic Data: Variability-Based Neighbor Clustering. Corpora 3 (1). 59–81.Google Scholar

Grover, Claire, and Tobin, Richard. 2006. Rule-Based Chunking and Reusability. In Calzolari, Nicoletta, Choukri, Khalid, Gangemi, Aldo, Maegaard, Bente, Mariani, Joseph, Odijk, Jan and Tapias, Daniel, eds. Proceedings of LREC 2006, Genoa, Italy: European Language Resources Association (ELRA). 873–8.Google Scholar

Gulordava, Kristina, and Merlo, Paola. 2015. Diachronic Trends in Word Order Freedom and Dependency Length in Dependency-Annotated Corpora of Latin and Ancient Greek. International Conference on Dependency Linguistics, Uppsala. www.aclweb.org/anthology/W15-2115.Google Scholar

Harrell, Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Cham: Springer.CrossRef Google Scholar

Hawkins, John A. 2004. Efficiency and Complexity in Grammars. Oxford: Oxford University Press.CrossRef Google Scholar

Hilpert, Martin, and Gries, Stefan. 2016. Quantitative Approaches to Diachronic Corpus Linguistics. In Kytö, Merja and Pahta, Paivi, eds. The Cambridge Handbook of English Historical Linguistics. Cambridge: Cambridge University Press. 36–53.CrossRef Google Scholar

Hopper, Paul J., and Traugott, Elizabeth Closs. 2003. Grammaticalization. 2nd ed. Cambridge Textbooks in Linguistics. Cambridge: Cambridge University Press.CrossRef Google Scholar

Hundt, Marianne, Denison, David and Schneider, Gerold. 2012. Relative Complexity in Scientific Discourse. English Language and Linguistics 16(2). 209–40.CrossRef Google Scholar

Klein, Dan, and Manning, Christopher. 2004. Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). 478–85. www.aclweb.org/anthology/P04–1000.CrossRef Google Scholar

Kroch, Anthony, and Taylor, Ann. 2000. Verb-Object Order in Early Middle English. In Pintzuk, Susan, Tsoulas, George and Warner, Anthony, eds. Diachronic Syntax: Models and Mechanisms. Oxford: Oxford University Press. 132–87.Google Scholar

Lapata, Mirella, and Keller, Frank. 2005. Web-Based Models for Natural Language Processing. ACM Transactions on Speech and Language Processing. 2(1). 1–31.Google Scholar

Larsson, Tove, Plonsky, Luke and Hancock, Gregory R.. 2020. On the Benefits of Structural Equation Modeling for Corpus Linguists. Corpus Linguistics and Linguistic Theory, published ahead of print. https://doi.org/10.1515/cllt-2020-0051.CrossRef Google Scholar

Leech, Geoffrey, Hundt, Marianne, Mair, Christian and Smith, Nicholas. 2009. Change in Contemporary English: A Grammatical Study. Cambridge: Cambridge University Press.Google Scholar

López-Couso, Maria José, Aarts, Bas and Méndez-Naya, Belén. 2012. Late Modern English Syntax. In Bergs, Alexander and Brinton, Laurel J., eds. Historical Linguistics of English: An International Handbook. Volume I. Handbooks of Linguistics and Communication Science [HSK] 34.1. Berlin: Mouton de Gruyter. 869–87.Google Scholar

Los, Bettelou. 2005. The Rise of the to-Infinitive. Oxford: Oxford University Press.CrossRef Google Scholar

Paul, Ranjit. 2017. Multicollinearity: Causes, Effects and Remedies. New Delhi: Indian Agricultural Research Institute. www.researchgate.net/publication/255640558_MULTICOLLINEARITY_CAUSES_EFFECTS_AND_REMEDIES.Google Scholar

Pfenninger, Simone. 2009. Grammaticalization Paths of English and High German Existential Constructions. Bern: Peter Lang.Google Scholar

Rayson, Paul, Archer, Dawn, Baron, Alistair, Culpeper, Jonathan and Smith, Nicholas. 2007. Tagging the Bard: Evaluating the Accuracy of a Modern POS Tagger on Early Modern English Corpora. In Davies, Matthew, Rayson, Paul, Hunston, Susan and Danielsson, Pernilla, eds. Proceedings of Corpus Linguistics 2007. University of Birmingham.Google Scholar

Röthlisberger, Melanie, and Schneider, Gerold. 2013. Of-Genitive versus s-Genitive: A Corpus-Based Analysis of Possessive Constructions in 20th Century English. In Bennett, Paul, Durrell, Martin, Silke Scheible, Richard J. Whitt, Holger Keibel, Kupietz, Marc and Mair, Christian, eds. New Methods in Historical Corpora. Tübingen: Narr. 163–80.Google Scholar

Marina, Santini. 2004. A Shallow Approach to Syntactic Feature Extraction for Genre Classification. Proceedings of the 7th Annual Colloquium for the UK Special Interest Group for Computational Linguistics (CLUK 2004). Birmingham, UK. https://pdfs.semanticscholar.org/4de8/23dac38edce09f60cfb3eb93524204f57e7e.pdf.Google Scholar

Saville-Troike, Muriel, and Barto, Karen. 2017. Introducing Second Language Acquisition. 3rd ed. Cambridge: Cambridge University Press.Google Scholar

Schmid, Helmut. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of International Conference on New Methods in Language Processing. www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger1.pdf.Google Scholar

Schneider, Gerold. 2008. Hybrid Long-Distance Functional Dependency Parsing. Doctoral thesis. Institute of Computational Linguistics, University of Zurich.Google Scholar

Schneider, Gerold. 2018. Differences between Swiss High German and German High German via Data-Driven Methods. In 3rd Swiss Text Analytics Conference (SwissText 2018). Winterthur, 12 June 2018–13 June 2018. CEUR-WS, 17–25. www.zora.uzh.ch/id/eprint/162838/.Google Scholar

Schneider, Gerold. 2020. Spelling Normalisation of Late Modern English: Comparison and Combination of VARD and Character-Based Statistical Machine Translation. In Kytö, Merja and Smitterberg, Erik, eds. Late Modern English: Novel Encounters. Studies in Language Series. Amsterdam: John Benjamins. 243–68.Google Scholar

Schneider, Gerold, Lehmann, Hans Martin and Schneider, Peter. 2014. Parsing Early and Late Modern English Corpora. Literary and Linguistic Computing 30(3). 423–39. https://doi.org/10.1093/llc/fqu001.Google Scholar

Schneider, Gerold, Hundt, Marianne and Oppliger, Rahel. 2016. Part-of-Speech in Historical Corpora: Tagger Evaluation and Ensemble Systems on ARCHER. In Dipper, Stefanie, Neubarth, Friedrich and Zinsmeister, Heike, eds. Proceedings of the 13th Conference on Natural Language Processing, KONVENS 2016, September 19–21, 2016. Bochumer Linguistische Arbeitsberichte 16. Bochum: Germany.Google Scholar

Schneider, Gerold, Pettersson, Eva and Percillier, Michael. 2017. Comparing Rule-Based and SMT-Based Spelling Normalisation for English Historical Texts. In Bouma, Gerlof and Adesam, Yvonne, eds. Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language. Linköping: Linköping University Electronic Press. 40–6.Google Scholar

Schreiber-Gregory, Deanna. 2018. Regulation Techniques for Multicollinearity: Lasso, ridge, and Elastic Nets. Proceedings of Western Users of SAS Software Conferences 2018, September 5–7, 2018. Sacramento, CA. www.lexjansen.com/wuss/2018/131_Final_Paper_PDF.pdf.Google Scholar

Shannon, Claude E. 1951. Prediction and Entropy of Printed English. The Bell System Technical Journal 30. 50–64.CrossRef Google Scholar

Sinclair, John McHardy, and Carter, Ronald. 2004. Trust the Text: Language, Corpus and Discourse. London: Routledge.Google Scholar

Szmrecsanyi, Benedikt, Rosenbach, Anette, Bresnan, Joan and Wolk, Christoph. 2014. Culturally Conditioned Language Change? A Multi-Variate Analysis of Genitive Constructions in ARCHER. In Hundt, Marianne, ed. Late Modern English Syntax. Cambridge: Cambridge University Press. 133–52.Google Scholar

Tesnière, Lucien. 1959. Eléments de syntaxe structurale. Paris: Librairie Klincksieck.Google Scholar

Tognini-Bonelli, Elena. 2001. Corpus Linguistics at Work. Amsterdam: John Benjamins.CrossRef Google Scholar

van Noord, Gertjan and Bouma, Gosse. 2009. Parsed Corpora for Linguistics. In Baldwin, Timothy and Kordoni, Valia, eds. Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous? Association for Computational Linguistics. 33–9. www.aclweb.org/anthology/W09–0107.Google Scholar

Xiao, Richard. 2009. Theory-Driven Corpus Research: Using Corpora to Inform Aspect Theory. In Lüdeling, Anke and Kytö, Merja, eds., Corpus Linguistics: An International Handbook, vol. 2. Berlin: Mouton de Gruyter. 987–1008.Google Scholar

Yang, Li-Gong, Jian, Zhu and Shi-Ping., Tang 2013. Keywords Extraction Based on Classification. Advanced Materials Research 765. 1604–9.Google Scholar

Book contents

10 - Comparing Corpus-Driven and Corpus-Based Approaches to Diachronic Variation

Summary

Keywords

Access options

References

Further Reading

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive