2 - Words, Sentences, Corpora
from I - Foundations
Published online by Cambridge University Press: 05 June 2012
Summary
This chapter is intended for readers who have little or no background in natural language processing. We introduce basic linguistics concepts and explain their relevance to statistical machine translation. Starting with words and their linguistic properties, we move up to sentences, issues of syntax and semantics. We also discuss the role text corpora play in the building of a statistical machine translation system and the basic methods used to obtain and prepare these data sources.
Words
Intuitively, the basic atomic unit of meaning is a word. For instance, the word house evokes the mental image of a rectangular building with a roof and smoking chimney, which may be surrounded by grass and trees and inhabited by a happy family. When house is used in a text, the reader adapts this meaning based on the context (her grandmother's house is different from the White House), but we are able to claim that almost all uses of house are connected to that basic unit of meaning.
There are smaller units, such as syllables or sounds, but a hou or s do not evoke the mental image or meaning of house. The notion of a word seems straightforward for English readers. The text you are reading right now has its words clearly separated by spaces, so they stand out as units.
- Type
- Chapter
- Information
- Statistical Machine Translation , pp. 33 - 62Publisher: Cambridge University PressPrint publication year: 2009