Text-to-text generation by monolingual machine translation
08 / 2008 - 06 / 2013
The aim of the MEMPHIX (memory based paraphrasing using implicit and explicit semantics) project is to develop a way to automatically generate paraphrases. The ability to paraphrase can serve to explain something or to provide feedback in dialogue. Generating shorter paraphrases is useful for subtitles or news feeds. Paraphrasing can also change the register of a text. It might also help to increase performance of question answering, dialogue systems and machine translation. In the MEMPHIX project, a system is built that learns to generate paraphrases on the basis of examples. This is done by treating paraphrasing as a monolingual machine translation task. While the generation of paraphrases can be driven in the first place by surface similarities (leaving semantics completely implicit), explicit semantic information may also play a role. Such information may be computed through automatic means (parsing, semantic role labeling, co-reference resolution). The project compares the direct implicit route with the use of explicitly computed semantics. The project has access to a Dutch corpus of over a million words developed in the DAESO project, consisting of pairs of texts that express paraphrased or at least comparable information from various domains. However, the data from DAESO alone might not be enough. The first step in this project has been to automatically collect more aligned paraphrases by mining the web: headline clusters are acquired from Google News. For each cluster, the available paraphrase candidates are selected using surface similarities. These aligned paraphrases can then be used to train an MT system.