Book contents
- Data and Methods in Corpus Linguistics
- Data and Methods in Corpus Linguistics
- Copyright page
- Contents
- Figures
- Tables
- Contributors
- Acknowledgements
- Introduction: Comparative Approaches to Data and Methods in Corpus Linguistics
- Part I Corpus Dimensions and the Viability of Methodological Approaches
- Part II Selection, Calibration and Preparation of Corpus Data
- Part III Perspectives on Multifactorial Methods
- Part IV Applications of Classification-Based Approaches
- 10 Comparing Corpus-Driven and Corpus-Based Approaches to Diachronic Variation
- 11 Comparing Annotation Types and n-Gram Sizes
- Index
- References
11 - Comparing Annotation Types and n-Gram Sizes
German Discourse Particles and Their English Reflexes in a Translation Corpus
from Part IV - Applications of Classification-Based Approaches
Published online by Cambridge University Press: 06 May 2022
- Data and Methods in Corpus Linguistics
- Data and Methods in Corpus Linguistics
- Copyright page
- Contents
- Figures
- Tables
- Contributors
- Acknowledgements
- Introduction: Comparative Approaches to Data and Methods in Corpus Linguistics
- Part I Corpus Dimensions and the Viability of Methodological Approaches
- Part II Selection, Calibration and Preparation of Corpus Data
- Part III Perspectives on Multifactorial Methods
- Part IV Applications of Classification-Based Approaches
- 10 Comparing Corpus-Driven and Corpus-Based Approaches to Diachronic Variation
- 11 Comparing Annotation Types and n-Gram Sizes
- Index
- References
Summary
The chapter addresses a problem of contrastive pragmatics: How can we study correspondences between pragmatic markers in two languages if one language has a class of elements that the other language lacks? Specifically, the contribution deals with modal particles of German (ja and doch) and their reflexes in English translations. As there is no predetermined set of potential English correspondences, traditional distributional analyses are not feasible, and methods from Natural Language Processing are explored instead. Using 32 types of n-grams, differing in length and type of annotation, three classification tasks are carried out, in order to identify cues in the English translations that reflect the presence (or absence) of a particle in the German original. The results show that lemma-unigrams and -bigrams are often most informative (i.e. most accurate), while trigrams and 1-skip-2-grams provide important information about concomitants of modal particles that unigrams and bigrams miss. The results show that linguistic observables (n-grams) as the basis of quantitative analyses need to be carefully selected and explored in terms of their contribution to linguistic analysis.
- Type
- Chapter
- Information
- Data and Methods in Corpus LinguisticsComparative Approaches, pp. 323 - 352Publisher: Cambridge University PressPrint publication year: 2022
References
Further Reading
References
- 2
- Cited by