Basiscript: a corpus of written language output as produced by elementary school children in the Netherlands, annotated for spelling, word frequencies and word properties, and a 20,000-word lexicon annotated for word senses.
05 / 2012 - 10 / 2015
Written text comprehension and production are indispensable skills in the information society. Information on the written language children encounter (written input) and produce (written output) is vital to researchers and to authors of school methods, of assessment tests, of other child-directed texts, and of automatic spelling and writing aid programmes While the NWO/GW-funded BasiLex project is directed at the compilation of a corpus of children's written input, BasiScript, the facility applied for here, is directed at a corpus of children?s written output. BasiScript will comprise 12 million words of written language output, based on three years of data collection in a large, representative sample of Dutch elementary school children. The corpus will be richly annotated: for each word form the correct orthographical form, the associated lemma and the part-of-speech will be provided. In addition, the 20,000 most frequent words will be annotated for word meaning while all words in the corpus will be annotated for several word properties (incl. corpus frequency and distribution, length, family size). These word properties in earlier research have shown to influence children's performance on various language tasks. BasiScript is of crucial importance to all research directed at investigating the development of written language (production) skills by children. BasiScript enables, for instance, research into the development (over a period of three years) of spelling, written vocabulary, and the use of morpho-syntactic structures. As the corpus comprises written language output data of a large number of children with different geographical, social, linguistic, and school backgrounds it will be possible to investigate what impact different backgrounds have. Moreover, the annotation for diagnosed handicaps (e.g. dyslexia) will benefit studies into the written capacities of dyslectic children, while the annotation for home language will ensure that BasiScript will also be very informative for any teacher who teaches Dutch as a second language. Through comparison with JASMIN-CGN, a corpus which includes Dutch children's spoken output, it will also be possible to address research questions pertaining to how the development of written language skills is influenced by the spoken language skills that are acquired prior to these. Furthermore, through comparison of the two derived word lists with 20,000 words used most frequently, in BasiLex and in BasiScript respectively, we gain insight into the commonalities and differences found in children's written word input and written word output: the effect of school methods can be investigated. For instance, are words that are presented in spelling lessons explicitly, spelled better than words with similar difficulty that are not presented explicitly? Or, to give another example, does input of abstract versus concrete words predict output to a similar degree? In other words, do children need to have read abstract words like "ambivalence" more often than concrete words like "elephant" in order to learn their meaning? BasiScript is of immediate relevance to and will serve the needs of researchers from various disciplines (including linguistics, psycholinguistics, sociolinguistics, language development, pedagogy and language technology) and the needs from professionals, (school practitioners and developers of school and test materials).