AMICUS: Automated motif discovery in cultural heritage and scientific communication texts
09 / 2009 - onbekend
We identify a common need in the processing of texts from both the cultural heritage (CH) and scientific communication (SC) domains: to perform automated, large-scale higher-order text analytics, i.e., to reach an advanced level of text understanding so that structured knowledge can be extracted from unstructured text. The four research groups that apply for this grant propose to tackle an important aspect of this complex issue by investigating how linguistic elements convey motifs in texts from the CH and the SC domains. The applicants share a joint working hypothesis that the identity of higher-order content-bearing elements, i.e., textual units that are typically designated for e.g. document indexing, classification, enrichment, and the like, strongly depends on community perception. An instance of such a prominent yet little investigated content-bearing unit is a motif: an element that keeps recurring in an artifact (e.g. in film, music, but also in folklore or scientific texts) by means of which often a narrative theme is conveyed. For example, the victory of the youngest son against all odds is a motif in folktales. In bioinformatics, the motif of a gene array study forms the mold of countless articles. In the newly developed area of web sciences, a common rhetorical motif is to refer to the threats of information overload on people. In all of these different fields, insiders are familiar with these motifs, while outsiders are not; motifs constitute a kind of high-level jargon. The four research teams in the project plan to explore how approaches to the above by their own professional means and philosophies can be combined, integrated, and reused in their respective research domains: (1) Tilburg University: automated text analytics, metadata extraction and creation; (2) Utrecht University: discourse structure, discourse semantics; (3) Swedish School of Library and Information Science, University of Gothenburg/University of Borås (Sweden): indexing, visualization, user studies; (4) University of Szeged (Hungary): synchronic and diachronic aspects of network emergence. By creating a common platform for the four research groups, we aim to lay the foundations of a transdisciplinary expert body with access to extensive international cooperation networks, one that creates, investigates and disseminates best practices of text analytics solutions applied to semantic and discourse structure in textual material from the domains of CH and SC. The tasks of the proposed network implicitly address knowledge sharing between largely disconnected professional communities, namely memory institutions (libraries, archives, museums) as custodians of cultural heritage; linguists and information scientists, who currently have limited cooperation when it comes to the safeguarding of CH, and disciplinary experts of text-based SC and CH, as well as discourse analysis. The teams bring with them expertise in three methodological dimensions of the CH and SC domain, namely (i) linguistic analysis, (ii) semantically motivated annotation, and (iii) user studies.