Linguistic Information Processing


Update content

Title Linguistic Information Processing
Period 01 / 2006 - 12 / 2009
Status Completed
Research number OND1331429


Mission statement The aim of this programme is to develop computational and mathematical models of the processing of spoken and written language. Focus Essential for reaching this goal are language resources such as large corpora of naturally produced texts (both written and spoken). Moreover, advanced tools and techniques are needed for extracting and modeling the linguistic structure of and the information carried by naturally produced stretches of discourse. Corpus data are complemented by phonetic and psycholinguistic experiments in the laboratory. On the one hand, this helps overcome the inescapable problem of data sparseness; on the other hand, it makes it possible to test and validate the predictions of our models against actual human behaviour. This programme, which combines theory building, application, and the creation of language resources, is structured in the form of four closely related sub-programmes: (1) Models for human and automatic speech recognition, (2) Lexical information processing, (3) Cross-fertilisation of rule-based and probabilistic language models, and (4) Corpus building and exploration. Subprogrammes 1. Models for human and automatic speech recognition The sub-programme Models for human and automatic speech recognition combines three closely connected research lines. The first line of research is directed towards improving automatic speech recognition systems, guided by insights from human speech recognition. The second line of research applies insights from speech technology for the development of comprehensive computational models of human speech recognition. Finally, the third line applies the insights obtained from modeling automatic and human speech recognition to multimodal human-machine interaction, computer-assisted language learning and augmented communication. The main focus of this sub-programme is on developing novel novel models of automatic and human speech recognition. We are building improved models for automatic speech recognition by integrating knowledge about essential aspects of human speech processing. At the same time, we are applying successful techniques from the field of automatic speech recognition to develop comprehensive computational models of human speech recognition. We plan to continue and extend research in modeling pronunciation variation and to further enhance our existing procedures for the creation of automatic phonetic transcriptions. We are also continuing our research in the field of robust automatic speech recognition, informed by the theory of active perception. A new line of research addresses the contribution that non-verbal, mainly prosodic, information in the speech signal can make to the recognition and interpretation of the verbal message. The long-term goal of this research is to create a novel architecture for speech recognition that can be used both as an operational recognition engine and as a realistic model of human speech recognition. Application directed research is concerned with multim multimodal human-machine interaction, computer-assisted language learning and augmented communication. In multimodal interaction the emphasis is on combining speech and pen input in services designed for use with small mobile terminals. In addition, we intend to investigate dialog models based on layered perception-action loops. Language learning applications address two different groups groups of users, users who are learning Dutch as their second language, and users with communicative disorders. For both user groups we are designing and testing computer-assisted systems for training and testing oral proficiency. The research aimed at relieving communication impairments is conducted in close collaboration with the Sint Maartenskliniek, in the framework of the Knowledge Centre for Language and Speech Technology in Rehabilitation. Although the research in this sub-programme is concer concerned primarily with Dutch as object language, we will also use English benchmark corpora when and where these are available to facilitate comparison with the results obtained in the international community. Not surprisingly, the Spoken Dutch Corpus and its emerging extensions play an important role in this sub-programme. 2. Lexical Information Processing Any tool in the arena of speech recognition and language technology depends on a lexicon that defines the words of the language or specialised sublanguage it subserves. In this sub-programme, we are investigating the details of the kinds of information stored in the lexicon. To do so, we combine methods from lexical statistics applied to large data sets extracted from corpora with phonetic and psycholinguistic experiments with human subjects. For the coming years, we envisage a continuation of our co our corpus-based surveys of the fine phonetic detail of morphologically complex words, which we have found to reflect to a remarkable degree the information load carried by a word in its lexical neighbourhood. We will also expand our investigations of the distri distribution of semantic properties of words by using state-of-art vector space technology. In parallel, our line of experimental research addres addressing human morphological processing will be continued. We have been able to establish that the human lexicon is extremely rich, in that it contains memory traces for almost any word, simple or complex, that it has encountered, and that it is highly sensitive to fine phonetic detail, both in speech comprehension and in speech production. Our experimental research will not only help constrain psycholinguistic models of lexical representation and processing, but will also help inform models of automatic speech recognition. 3. Cross-fertilisation of rule-based and probabilistic language models The field of research traditionally known as computational linguistics has focused nearly exclusively on the processing of written language. This sub-programme intends to apply the experience and expertise in this field to mainstream language technology. Our key interest is how structural linguistic information can contribute to improving the performance of applications such as information retrieval and question answering. The present generation of applications in language technology is almost exclusively based on probabilistic techniques. The aim of this research is to better understand the reasons for the limitations in the performance of existing technology and how the performance of these applications can be improved by including linguistic knowledge provided by the powerful rule-based parsers that we have developed for English and Dutch. We expect that â as a spin-off â this endeavor will result in more comprehensive descriptions (grammars), as well as in improved parsers. We aim to develop principled methods for adapting wide-coverage parsers to the characteristics of language use in specific domains, such as telecommunication, and the medical and legal professions. The first field in which we will test our approach is questi question-answering. Specifically, we will address the issue of how principled linguistic processing can be brought to bear on the as yet unsolved problem of answering why-questions. In the past, little attention has been paid to the usabil usability of products based on language technology. Specifically, the human factor in information extraction and information retrieval is under-researched. This sub-programme addresses the human factor explicitly with respect to the actual usability of language technology products, and specifically those applications designed for information extraction. 4. Development and evaluation of language resources and tools Each of the three preceding sub-programmes crucially depends on the availability of large corpora of spoken and written language as well as on suitable tools for processing the corpus data. In the past, this sub-programme was heavily involved in the creation and maintenance of the Spoken Dutch Corpus, and the accumulated expertise in corpus building is now applied to the development of further corpora. We intend to strengthen our position in the STEVIN Programme by taking a leading role in the creation of a very large and richly annotated corpus of contemporary written Dutch. In addition, we want to play a pivotal role in enlarging the Spoken Dutch Corpus, and in building processing tools for spoken language, such as orthographic and automatic phonetic transcription. In parallel, this sub-programme is developing a range of corpus-based tools for novel and scientifically interesting applications of language technology, for example, stylometry, authorship attribution, and the profiling of socio-geographic language variation. This will allow us to extend the domain in which our expertise in language resources and language technology tools and techniques can be deployed to other fields of inquiry, such as literary and historical studies. During the last decade the issue of how applications of lan of language and speech technology can be evaluated has become ever more urgent. We will respond to this development by developing suitable evaluation procedures for the applications mentioned in the previous paragraph, as well as for the evaluation of language technology components in more conventional applications such as question answering. During the next years we will intensify our already close close working relations with the MPI, and especially the Technical Support Group for building an e-Humanities research infrastructure dedicated to the needs of linguistics and literary and cultural studies. For the development of tools for speech processing we will closely collaborate with the Katholieke Universiteit Leuven.

Related organisations

Related people

Related research (lower level)

Go to page top
Go back to contents
Go back to site navigation