The TextCoop Project
Objectives
The TextCoop project aims at investigating in depth the syntax and semantics of a large variety of procedural texts, from professional text (maintenance, security, didactics, etc.) to large public productions (cooking recipes, health supports, video game solutions, etc.). The project has a strong conceptual and linguistic dimension.
The first step of the project, which is now fully realized via and ANR-RNTL project, aims at annotating the different discourse parts of procedures, most notably: titles (goals) and title hierarchy, instructions, instructional compounds, prerequisites, warnings, advice, and a large variety of explanation devices (examples, elaborations, reformulations, etc.). A prototype, fully implemented in Perl has been realized and evaluated. It is now under development as software with Java tools with the support of the Région Midi-Pyrénées. Some additional work have been carried out in conjunction with EADS.
The second step of the project is the investigation of advanced uses of this environment, among which:
- Customization of the software for concrete applications (in the industry or for the large public), in principle, little resources need to be integrated,
- Development of patterns for various languages (English, Spanish, Asian languages),
- Development of oral dialogue and multimedia facilities, e.g. for help desks,
- Procedural text annotation and enrichment via the definition of additional patterns, this may include adding annotations to make tasks more precise, or analyzing argument structures: e.g. tools, durations of instructions, also temporal and conditionals analysis, etc.
- Development of relatively feasible tasks: identifying prerequisites from instructions (tools, consumables, etc.), identifying the number of required participants, the approximate duration of the task (and possible idle periods), etc. This is realized mainly via lexical inference and a close analysis of instructions and their connectors.
- Development of editorial tools for improving the writing quality of procedures,
- Development of advanced functions: among which: coherence and cohesion detection, procedure fusion, procedure simplification, constructing larger procedures out of simpler ones, zooming on difficult instructions, development of additional warnings and advice, analysis of the difficulty and the risks of a task, etc. Most of these tools involve both linguistic and reasoning aspects, and the taking into account of the domain specificities.
The TextCoop Environment and Software
The TextCoop environment and software is developed by the ILPL group at IRIT as a result of an ANR-RNTL project. It develops annotation technologies for any natural language document (from the Web or from textual database) based on patterns or grammars. These grammars or patterns may include typographic, morphological, syntactic and semantic factors. It also introduces indexing techniques for procedural texts for information retrieval of question-answering. Annotations are defined at the pattern or rule levels and can be customized to dedicated tasks. TextCoop focuses on Procedural text annotation; it can be used to enrich such documents. At the moment, it correctly identifies and annotates:
- Relevant titles, and title hierarchy (crucial in Web texts),
- Instructions and instructional compounds,
- Pre-requisites of various forms,
- Warnings and advice, goal expressions, temporal marks,
- Various forms of explanation: illustrations, reformulation, definition, etc.
This software has been realized from a corpus of 8000 procedural texts ranging over 24 domains, including maintenance and do-it-yourself. At the moment, our tests, realized on 1200 texts show a precision of 0.97 (the factor that has been favored) and a recall of 0.80. The prototype is implemented in Perl and can process at the moment (including cleaning web texts) 300 Mo of texts per hour. The portability of the system to various domains is good since the patterns are essentially based on general purpose criteria.

Illustration: TextCoop architecture, integration into UIMA Norms/platform
From a software point of view, the TextCoop prototype is now evolving to become shortly a software component implemented in Java (with the support of the AVAMIP and the Région Midi-Pyrénées). It will include (see diagram below) or is being designed so that:
- An engine, based on the well-established JFLEX and JCUP Java tools, is the system kernel, with additional parameters to manage rule or patterns priorities, rule or pattern selection (for customization or views productions), etc.
- The input documents can be a priori any type of Web page (with a parameterized cleaning software), or any kind of XML documents from textual databases, the output is the original document augmented with the annotations required,
- TextCoop is designed to accept as input modules a large variety of lexicon and ontology formats (including OWL and variants) when required by the patterns or grammars. These resources are automatically compiled in JFLEX format.
- It will have an administrator and a user interface so that the system parameters can be managed and extended and so that new data (rules, patterns, lexical entries, ontological data, etc.) can be added and tested in a principled and reliable way. Similarly a non-regression test bed is being introduced to facilitate evaluations and controls for example in development or customization contexts. The rule format, close to logical expressions, will allow for the integration of inference rules (common-sense or based on domain ontology).
With the aim of allowing for an easy integration into large industrial systems, it will be embedded into the UIMA framework and its I/O parameters will be made UIMA compliant.
Schedule
- May 2005 - Sept 2010
Support
- ANR-RNTL
Partners
- Sinequa, LIPN
- EADS
Software
Some components will be freeware, while others may be payware.