Strictly speaking, all historical English linguistic studies are corpus-based in the sense that historical linguists, who cannot use native-speaker informants or their own intuition as a source of information, have always had to rely on historical texts as their source of information. The main difference between modern-day, computer-based historical linguistics and traditional historical linguistics is that, before the advent of computer-readable text collections, researchers had to read through (large) amounts of texts for their data and record individual relevant examples on slips of paper.1 Digitized texts offer a much more efficient way of retrieving data. But the question is whether every kind of digital text collection constitutes a corpus or not. We might also want to ask whether all historical linguistic research that draws on computerized text collections as a source of information is automatically corpus-based or whether we want to define “corpusbased historical linguistics” more narrowly.
According to Biber et al. (1998: 4), the defining characteristics of a corpusbased approach are that
• it is empirical, analysing the actual patterns of use in natural texts;
• it utilizes a large and principled collection of natural texts, known as a “corpus,” as the basis for analysis;
• it makes extensive use of computers for analysis, using both automatic and interactive techniques; and
• it depends on both quantitative and qualitative analytical techniques.
As regards the first characteristic, we need to be aware of the relation between method and theory: unlike research done within the generative tradition (see Chapter 3), corpus-based approaches typically make use of functional and cognitive models of grammar (see Chapter 4).
The idea of a principled text collection means that corpus compilation typically relies on a sampling frame, i.e., a clear idea as to which text extracts from which (sub)period and which range of text types will be included in the corpus. In order for this sampling frame to make the text collection representative of a language or a specific subsection of it, it needs to include “the full range of variability in a population” (Biber 1993: 243) and be based on a clear understanding of what that target population is.