Skip to main content Accessibility help
×
Hostname: page-component-8448b6f56d-wq2xx Total loading time: 0 Render date: 2024-04-18T23:13:52.776Z Has data issue: false hasContentIssue false

1 - Comparing Standard Reference Corpora and Google Books Ngrams

Strengths, Limitations and Synergies in the Contrastive Study of Variable h- in British and American English

from Part I - Corpus Dimensions and the Viability of Methodological Approaches

Published online by Cambridge University Press:  06 May 2022

Ole Schützler
Affiliation:
Universität Leipzig
Julia Schlüter
Affiliation:
Universität Bamberg
Get access

Summary

This chapter is based on two standard reference corpora, the British National Corpus and the Corpus of Contemporary American English, as opposed to the multi-billion-word database of Google Books Ngrams, which has, despite its allure, not been used in many systematic linguistic studies so far. Focusing on indefinite article allomorphy (a vs an) as an orthographic cue to the phonological strength of ‹h›-onsets in British and American English, the size advantage of the Ngrams database expectedly plays out in larger type and token counts, more stable estimates and fewer distortions due to data sparsity. However, as metadata are extremely limited (to year and variety), a fully accountable analysis is not feasible. The case study illustrates how richly annotated corpora can shed light on potential disturbances arising from two sources: genre differences and between-author variability. A sensitivity analysis offers some degree of reassurance when extending the analysis to the Ngrams database. In this way, the authors demonstrate that the strengths and limitations of corpora and big data resources can, with due caution, be counterbalanced to answer questions of linguistic interest.

Type
Chapter
Information
Data and Methods in Corpus Linguistics
Comparative Approaches
, pp. 17 - 45
Publisher: Cambridge University Press
Print publication year: 2022

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Further Reading

Cochran, William G. 1983. Planning and Analysis of Observational Studies. New York: John Wiley & Sons. Chapters 1, 2, 3, 4, 6.Google Scholar
Greenland, Sander, Mansourina, Mohammad Ali and Altman, Douglas G.. 2016. Sparse Data Bias: A Problem Hiding in Plain Sight. British Medical Journal 352(i1982). https://doi.org/10.1136/bmj.i1981.Google Scholar
Johnson, Daniel E. 2014. Progress in Regression: Why Natural Language Data Calls for Mixed-Effects Models. Unpublished manuscript. www.danielezrajohnson.com/johnson_2014b.pdf.Google Scholar
Winter, Bodo. 2020. Statistics for Linguistics. New York: Routledge. Chapters 14 and 15.Google Scholar

References

Algeo, John. 2006. British or American English? A Handbook of Word and Grammar Patterns. Cambridge: Cambridge University Press.Google Scholar
Barth, Danielle, and Kapatsinski, Vsevolod. 2018. Evaluating Logistic Mixed-Effects Models of Corpus-Linguistic Data in Light of Lexical Diffusion. In Speelman, Dirk, Heylen, Kris and Geeraerts, Dirk, eds. Mixed-Effects Regression Models in Linguistics. New York: Springer. 99116.Google Scholar
Biber, Douglas. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press.Google Scholar
Biber, Douglas, and Gray, Bethany. 2013. Being Specific about Historical Change: The Influence of Sub-Register. Journal of English Linguistics 41(2). 104–34. http://eng.sagepub.com/cgi/doi/10.1177/0075424212472509.CrossRefGoogle Scholar
Burnard, Lou, ed. 2007. Reference Guide for the British National Corpus (XML edition). British National Corpus Consortium & Research Technologies Service at Oxford University Computing Services. www.natcorp.ox.ac.uk/docs/URG/BNCdes.html.Google Scholar
Cruttenden, Alan. 2014. Gimson’s Pronunciation of English. 8th ed. London: Arnold.Google Scholar
Davies, Mark. 2008–. The Corpus of Contemporary American English (COCA): 600 Million Words, 1990–Present. www.english-corpora.org/coca.Google Scholar
Desgraupes, Bernard, and Loiseau, Sylvain. 2018. rcqp: Interface to the Corpus Query Protocol. R package version 0.5. https://CRAN.R-project.org/package=rcqp.Google Scholar
Elwert, Felix, and Winship, Christopher. 2014. Endogenous Selection Bias: The Problem of Conditioning on a Collider Variable. Annual Review of Sociology 40. 3153. https://doi.org/10.1146/annurev-soc-071913-043455.CrossRefGoogle ScholarPubMed
Firth, David. 1993. Bias Reduction of Maximum Likelihood Estimates. Biometrika 80(1). 2738. https://doi.org/10.2307/2336755.CrossRefGoogle Scholar
Gelman, Andrew, and Greenland, Sander. 2019. Are Confidence Intervals Better Termed ‘Uncertainty Intervals’? British Medical Journal 366(l5381). https://doi.org/10.1136/bmj.l5381.Google Scholar
Greenland, Sander, Mansourina, Mohammad Ali and Altman, Douglas G. 2016. Sparse Data Bias: A Problem Hiding in Plain Sight. British Medical Journal 352 (i1982). https://doi.org/10.1136/bmj.i1981.Google Scholar
Hiltunen, Turo, McVeigh, Joe and Säily, Tanja. 2017. How to Turn Linguistic Data into Evidence? In Hiltunen, Turo, McVeigh, Joe and Säily, Tanja, eds. Big and Rich Data in English Corpus Linguistics: Methods and Explorations. Studies in Variation, Contacts and Change in English 19. www.helsinki.fi/varieng/series/volumes/19/introduction.html.Google Scholar
Johnson, Daniel E. 2014. Progress in Regression: Why Natural Language Data Calls for Mixed-Effects Models. Unpublished manuscript. www.danielezrajohnson.com/johnson_2014b.pdf.Google Scholar
Jones, Daniel. 2011. English Pronouncing Dictionary (EPD). Edited by Roach, Peter, Setter, Jane and Esling, John. 18th ed. Cambridge: Cambridge University Press. CD-ROM edition.Google Scholar
Koplenig, Alexander. 2017. The Impact of Lacking Metadata for the Measurement of Cultural and Linguistic Change Using the Google Ngram Data Sets: Reconstructing the Composition of the German Corpus in Times of WWII. Digital Scholarship in the Humanities 21(1). 169–88. https://doi.org/10.1093/llc/fqv037.Google Scholar
Lass, Roger, and Laing, Margaret. 2010. In Celebration of Early Middle English ‘H’. Neuphilologische Mitteilungen 111(3). 345–54.Google Scholar
Michel, Jean-Baptiste, Shen, Yuan Kui, Aiden, Aviva Presser et al. 2010. Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331(6014). 176–82. https://doi.org/10.1126/science.1199644.Google Scholar
Minkova, Donka, ed. 2009. Phonological Weakness in English: From Old to Present-Day English. Basingstoke and New York: Palgrave Macmillan.CrossRefGoogle Scholar
Minkova, Donka. 2014. A Historical Phonology of English. Edinburgh: Edinburgh University Press.Google Scholar
OED (Oxford English Dictionary Online). 2000–. Oxford: Oxford University Press. http://dictionary.oed.com/ (accessed 3 March 2020).Google Scholar
Pechenick, Eitan, Danforth, Christopher M. and Dodds, Peter Sheridan. 2015. Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution. PLoS ONE 10(10), e0137041. https://doi.org/10.1371/journal.pone.0137041.Google Scholar
Peters, Pam. 2004. The Cambridge Guide to English Usage. Cambridge: Cambridge University Press.Google Scholar
R Core Team. 2019. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. www.R-project.org/.Google Scholar
Scherer, Ralph. 2018. PropCIs: Various Confidence Interval Methods for Proportions. R package version 0.3–0. https://CRAN.R-project.org/package=PropCIs.Google Scholar
Schlüter, Julia. 2019. Tracing the (Re-)Emergence of /h/ and /j/ through 350 Years of Books: Mergers and Merger Reversals at the Interface of Phonetics and Phonology. Folia Linguistica 40(s1). Special issue on diachronic phonotactics. Edited by Nikolaus Ritt, Andreas Baumann and Christina Prömer. 177202. https://doi.org/10.1515/flih-2019-0009.Google Scholar
Schlüter, Julia, and Vetter, Fabian. 2020. An Interactive Visualization of Google Books Ngrams with R and Shiny: Exploring a(n) Historical Increase in Onset Strength in a(n) Huge Database. Journal of Data Mining and Digital Humanities 21. Special issue on visualizations in historical linguistics. Edited by Benjamin Molineaux, Bettelou Los and Martti Mäkinen. https://jdmdh.episciences.org/7000.Google Scholar
Speelman, Dirk, Heylen, Kris and Geeraerts, Dirk, eds. 2018. Mixed-Effects Regression Models in Linguistics. New York: Springer.Google Scholar
Steel, E. Ashley, Liermann, Martin and Guttorp, Peter. 2019. Beyond Calculations: A Course in Statistical Thinking. The American Statistician 73. 392401. https://doi.org/10.1080/00031305.2018.1505657.CrossRefGoogle Scholar
Wells, John. 2008. Longman Pronunciation Dictionary (LPD). 3rd ed. Harlow: Pearson Longman. CD-ROM edition: Longman Pronunciation Coach.Google Scholar
Winter, Bodo. 2020. Statistics for Linguistics. New York: Routledge.Google Scholar
Winter, Bodo, and Grice, Martine. 2021. Independence and Generalizability in Linguistics. Linguistics 59(5). 1251–77.CrossRefGoogle Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×