Learning computational grammars using inductive logic programming (ILP)
10 / 1998 - 11 / 2003
The approach to the task of machine learning computational grammars that is intended to explore, is that of Inductive Logic Programming. ILP is a general machine learning methodology based on the operation of induction. Induction can be seen as the reverse of deduction. As such, it can be used to generalise from background knowledge and a set of examples (data) to a logic programme that fits both the background and the data. The data is derived from annotated linguistic corpora, such as the Penn TreeBank. One domain-independent ILP algorithm is Progol. Its Prolog implementation (P-Progol) has been successfully employed in various domains, but not natural language. Trying to use it to induce a Definite Clause Grammar for English NPs, was not so successful. The experience gained will be valuable for the task of developing an ILP learner specifically designed for the task of inducing natural language grammars, with the hope of overcoming the limitations of a general purpose machine learner. The main issues to be tackled are the adaptation of the generic algorithm to the task and the design of a background knowledge declaration formalism that will be suitable to describe known cross-linguistic generalisations as well as facts specific to the language being learned. Accurate and detailed syntactically annotated corpora are difficult to build and their number and size is very limited. It would, therefore, be of interest to investigate the extent to which annotation is necessary for the learning process. A learner with a transparent, symbolic representation of background knowledge can, therefore, be used to try and establish the relation between background knowledge and level of annotation. In particular the aim is to investigate whether cross-linguistic generalisations known to hold in Germanic syntax can be used to reduce the detail and/or consistency requirements.