Skip to main content Accessibility help

Morphosyntactic annotation of CHILDES transcripts*



Corpora of child language are essential for research in child language acquisition and psycholinguistics. Linguistic annotation of the corpora provides researchers with better means for exploring the development of grammatical constructions and their usage. We describe a project whose goal is to annotate the English section of the CHILDES database with grammatical relations in the form of labeled dependency structures. We have produced a corpus of over 18,800 utterances (approximately 65,000 words) with manually curated gold-standard grammatical relation annotations. Using this corpus, we have developed a highly accurate data-driven parser for the English CHILDES data, which we used to automatically annotate the remainder of the English section of CHILDES. We have also extended the parser to Spanish, and are currently working on supporting more languages. The parser and the manually and automatically annotated data are freely available for research purposes.


Corresponding author

Address for correspondence: Kenji Sagae, USC Institute for Creative Technologies, 13274 Fiji Way, Marina del Rey, CA 90292. e-mail:


Hide All

We thank Marina Fedner for help with annotation of the English data, and Bracha Nir for help with annotation of the Hebrew data. This research was supported in part by Grant No. 2007241 from the United States–Israel Binational Science Foundation (BSF) and by the National Science Foundation (NSF) under grant IIS-0414630.



Hide All
Berger, A., Della Pietra, S. A. & Della Pietra, V. J. (1996). A maximum entropy approach to natural language processing. Computational Linguistics 22(1), 3971.
Berman, R. A. (1978). Modern Hebrew structure. Tel Aviv: University Publishing Projects.
Berman, R. A. (1979). Lexical decomposition and lexical unity in the expression of derived verbal categories in modern Hebrew. Afroasiatic Linguistics 6, 126.
Bloom, L. (1970). Language development: Form and function in emerging grammars. Cambridge, MA: MIT Press.
Bod, R. (2009). From exemplar to grammar: A probabilistic analogy-based model of language learning. Cognitive Science 33(5), 752–93.
Borensztajn, G., Zuidema, J. & Bod, R. (2009). Children's grammars grow more abstract with age – evidence from an automatic procedure for identifying the productive units of language. Topics in Cognitive Science 1, 175–88.
Briscoe, T. & Carroll, J. (1993). Generalised probabilistic lr parsing of natural language (corpora) with unification-based grammars. Computational Linguistics 19(1), 2559.
Brown, R. (1973). A first language: The early stages. Cambridge, MA: Harvard University Press.
Buchholz, S. & Marsi, E. (2006). Conll-x shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CONLL-x), 149–64. New York City: Association for Computational Linguistics.
Charniak, E. (2000). A maximum-entropy-inspired parser. In Proceedings of the First Conference of the North American Chapter of the Association for Computational Linguistics, 132–39. San Francisco, CA: Morgan Kaufmann Publishers Inc.
Doron, E. (1983). Verbless predicates in Hebrew. Unpublished doctoral dissertation, University of Texas at Austin.
Hudson, R. A. (1984). Word grammar. Oxford: Basil Blackwell.
Knuth, D. (1965). On the translation of languages from left to right. Information and Control 8(6), 607639.
Lee, L. (1974). Developmental sentence analysis. Evanston, IL: Northwestern University Press.
MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk, 3rd edn. Mahwah, NJ: Lawrence Erlbaum Associates.
MacWhinney, B. (2008). Enriching CHILDES for morphosyntactic analysis. In Behrens, H. (ed.), Corpora in language acquisition research: History, methods, perspectives, Vol. 6, 165–98. Amsterdam: Benjamins.
Mel'čuk, I. A. (1988). Dependency syntax: Theorie and practice. Albany, NY: SUNY Press.
Nivre, J. (2003). An efficient algorithm for projective dependency parsing. In Proceedings of the Eighth International Worskshop on Parsing Technologies (IWPT), 149–60. Nancy.
Nivre, J., Hall, J., Nilsson, J., Eryigit, G. & Marinov, S. (2006). Labeled pseudo-projective dependency parsing with support vector machines. In Proceedings of the Tenth Conference on Computational Natural Language Learning, 221–25. New York: Association for Computational Linguistics.
Parisse, C. & Le Normand, M.-T. (2000). Automatic disambiguation of the morphosyntax in spoken language corpora. Behavior Research Methods, Instruments and Computers 32, 468–81.
Peters, A. M. (1983). The units of language acquisition. New York: Cambridge University Press.
Sagae, K. & Lavie, A. (2006). A best-first probabilistic shift-reduce parser. In Proceedings of the Coling/ACL Poster Session, 691–98. Sydney: Association for Computational Linguistics.
Sagae, K., Lavie, A. & MacWhinney, B. (2004). Adding syntactic annotations to transcripts of parent–child dialogs. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), 1815–18. Lisbon: European Language Resources Association.
Sagae, K., Lavie, A. & MacWhinney, B. (2005). Automatic measurement of syntactic development in child language. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), 197–204. Ann Arbor, MI: Association for Computational Linguistics.
Sagae, K. & Tsujii, J. (2007). Dependency parsing and domain adaptation with LR models and parser ensembles. In Proceedings of the CONLL Shared Task Session of the Joint Conferences on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CONLL 2007), 1044–50. Prague: Association for Computational Linguistics.
Scarborough, H. S. (1990). Index of productive syntax. Applied Psycholinguistics 11, 122.
Tomita, M. (ed.) (1991). Generalized LR parsing. Boston: Kluwer Academic Publishing.
Wilson, B. & Peters, A. M. (1988). What are you cookin' on a hot?: A three-year-old blind child's ‘violation’ of universal constraints on constituent movement. Language 64, 249–73.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Journal of Child Language
  • ISSN: 0305-0009
  • EISSN: 1469-7602
  • URL: /core/journals/journal-of-child-language
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed