Book contents
- Frontmatter
- Contents
- Preface
- 1 Introduction: goals and methods of the corpus-based approach
- Part I Investigating the use of language features
- Part II Investigating the characteristics of varietie
- Part III Summing up and looking ahead
- Part IV Methodology boxes
- 1 Issues in corpus design
- 2 Issues in diachronic corpus design
- 3 Concordancing packages versus programming for corpus analysis
- 4 Characteristics of tagged corpora
- 5 The process of tagging
- 6 Norming frequency counts
- 7 Statistical measures of lexical associations
- 8 The unit of analysis in corpus-based studies
- 9 Significance tests and the reporting of statistics
- 10 Factor loadings and dimension scores
- Appendix: commercially available corpora and analytical tools
- References
- Index
5 - The process of tagging
Published online by Cambridge University Press: 05 June 2012
- Frontmatter
- Contents
- Preface
- 1 Introduction: goals and methods of the corpus-based approach
- Part I Investigating the use of language features
- Part II Investigating the characteristics of varietie
- Part III Summing up and looking ahead
- Part IV Methodology boxes
- 1 Issues in corpus design
- 2 Issues in diachronic corpus design
- 3 Concordancing packages versus programming for corpus analysis
- 4 Characteristics of tagged corpora
- 5 The process of tagging
- 6 Norming frequency counts
- 7 Statistical measures of lexical associations
- 8 The unit of analysis in corpus-based studies
- 9 Significance tests and the reporting of statistics
- 10 Factor loadings and dimension scores
- Appendix: commercially available corpora and analytical tools
- References
- Index
Summary
The analysis process
Most taggers (the programs that tag an uncoded corpus) make use of several kinds of information. First, they have dictionaries which list the category or categories that a particular word can belong to. Some words, such as the and a are not ambiguous; they can be automatically identified as the definite and indefinite article. Other words are ambiguous, such as deal, which can be a noun or a verb. Dictionaries can also identify fixed expressions (e.g., identifying the sequence and so forth as an adverb or such that as a subordinator). Finally, dictionaries can have lists of words that take certain grammatical patterns (e.g., the verbs or nouns that can control complement clauses).
For words that are ambiguous, many taggers make use of probabilistic information. This information is based on previous accurately tagged corpora (such as the LOB, for which all the grammatical tags were checked). The probabilistic information will tell the tagger how likely it is that a given word belongs to one class or another. Book, for instance, can be a verb or a noun, but it has a much higher probability of occurring as a noun.
Probabilities can also be applied to a sequence of tags. For example, to disambiguate respect in the phrase “in respect of the,” the tagger would consider the probability of a preposition-verb-preposition sequence versus a preposition-noun-preposition sequence.
- Type
- Chapter
- Information
- Corpus LinguisticsInvestigating Language Structure and Use, pp. 261 - 262Publisher: Cambridge University PressPrint publication year: 1998