Computational linguistics is a research field which investigates computational mechanisms of language comprehension and production, realizing them as computer programs. Researchers in this field have been working on formal descriptions of lexical, syntactic, semantic, and pragmatic knowledge, and algorithms for parsing and generation, which utilize these descriptions as rules for governing the computational process and heuristics for assigning preference among them. In traditional approaches, these rules and heuristics were designed by expert linguists and computational linguists, but in more recent approaches, they are automatically acquired from large language resources such as text/speech corpora and electronic dictionaries and thesauri. In the following sections, we describe language resources for Japanese computational linguistics, tools for retrieving and processing information in them, and how they are used in computational linguistic studies and can be used in psycholinguistic studies.
Text/speech corpora are the most important resources in computational linguistics. It can be said that the progress of research on a particular language depends heavily on how many good corpora are available in that language. Japan has fallen behind the US and European countries in corpus development and maintenance. Before 1990s, researchers at individual institutes developed small-scale corpora, and no leading organization like the LDC (Linguistics Data Consortium) in the US was formed. We had to wait until the mid 1990s before large-scale Japanese corpora became available.
Table summarizes major text/speech corpora of Japanese which are currently, or will be in the near future, available. They are extensively used in computational linguistic studies of Japanese such as development and tuning of automatic morphological analyzers, syntactic parsers, information retrieval, text summarization, and dialog systems.