Book contents
- Frontmatter
- Contents
- List of Figures
- List of Tables
- List of Contributors
- Preface
- Acknowledgements
- Introduction: Investigating language variation and change
- Part 1 Collecting empirical data
- Part 1.1 Fieldwork and linguistic mapping
- Part 1.2 Eliciting linguistic data
- Part 1.3 Alternatives to standard reference corpora
- 6 Using historical literature databases as corpora
- 7 Using the OED quotations database as a diachronic corpus
- 8 Using web-based data for the study of global English
- Part 2 Analysing empirical data
- Part 3 Evaluating empirical data
- Bibliography
- Index
- References
8 - Using web-based data for the study of global English
Published online by Cambridge University Press: 05 June 2014
- Frontmatter
- Contents
- List of Figures
- List of Tables
- List of Contributors
- Preface
- Acknowledgements
- Introduction: Investigating language variation and change
- Part 1 Collecting empirical data
- Part 1.1 Fieldwork and linguistic mapping
- Part 1.2 Eliciting linguistic data
- Part 1.3 Alternatives to standard reference corpora
- 6 Using historical literature databases as corpora
- 7 Using the OED quotations database as a diachronic corpus
- 8 Using web-based data for the study of global English
- Part 2 Analysing empirical data
- Part 3 Evaluating empirical data
- Bibliography
- Index
- References
Summary
Introduction
According to Biber et al. (1998: 4), a corpus is ‘a large and principled collection of natural texts’ (my emphasis). This definition of a corpus obviously does not apply to the huge collection of texts that the World Wide Web constitutes, and in the more narrow corpus linguistic terms, the web can therefore not be considered a corpus. However, the data available on the web have been used increasingly in corpus linguistic investigations. The focus of this chapter will be on why this is the case, how this can be done, as well as the gains and limitations of using web-based data for linguistic research.
There are several reasons why linguists have turned to the World Wide Web as a source of data. For the study of some phenomena, even large corpora comprising 100 million words or more are still not large enough. This holds for most kinds of lexicographic research, but investigating some of the more ephemeral points in English grammar may also necessitate larger sources of data. In addition, the internet has given rise to new text types such as e-mail, chat-room discussions, text messaging, blogs, or interactive internet magazines – text types that are interesting objects of study in themselves (e.g. Herring and Paolillo 2006; Tagliamonte 2008). Another reason for the allure of the World Wide Web is that it takes a long time and considerable financial resources to compile standard reference corpora. Moreover, these representative corpora are quickly out of date when it comes to recent or ongoing change; Baker (2009) describes how the internet can be used to supplement existing standard corpora in this respect. Furthermore, apart from the International Corpus of English (ICE), corpus linguistics has largely focused on so-called inner-circle varieties of English, i.e. varieties of English as a first language; moreover, within the inner circle, the focus has been mostly on British (BrE) and American English (AmE). For even slightly more exotic varieties of English – like Bangladeshi or Pakistani English – we do not even have ICE components and are very unlikely to see them in the (near) future. The discussion in this chapter also applies in large parts to the recently made available Corpus of Global Web-Based English (GloWbE) (see corpus2.byu.edu/glowbe), a web-derived corpus of world Englishes.
- Type
- Chapter
- Information
- Research Methods in Language Variation and Change , pp. 158 - 178Publisher: Cambridge University PressPrint publication year: 2013
References
- 5
- Cited by