Corpora

My work over the last several years includes the creation of the phonetic transcription component of the Weist-Jarosz Corpus of Child Polish, which is freely available as part of the CHILDES project on child language. The corpus includes audio recordings of spontaneous productions of four children acquiring Polish and their interactions with their primary caregivers.

The audio-linked phonetic and orthographic transcripts of the child speech can be viewed online at:

http://childes.psy.cmu.edu/browser/index.php?url=Slavic/Polish/Weist-Jarosz/

and downloaded in CHAT format here:

http://childes.psy.cmu.edu/data/Slavic/Polish/

The corpus is also available as part of the PhonBank project on the child phonology and is available in the Phon format here:

http://childes.talkbank.org/phon/phoncorpora.html

Please let me know if you use the data for any projects. I would love to hear what it is being used for. If you use this corpus in published materials, please cite the following two papers for the phonological component of the corpus:

  • Gaja Jarosz, Shira Calamaro, and Jason Zentz. 2013. Input Frequency and the Acquisition of Syllable Structure in Polish. Yale University Manuscript.
  • Gaja Jarosz. 2010. Implicational Markedness and Frequency in Constraint-Based Computational Models of Phonological Learning. In Journal of Child Language 37(3), Special Issue on Computational models of child language learning, 565-606. Cambridge: Cambridge University Press.