Accueil du site > English > Research Topics > Topic 2 - Indexing and Information Search > SIG team > Problematics
SIG team has 4 components:
This field of activity is interested in the description of documents content and structure, using specification or elicitation mechanisms, possibly supported by the annotation via metadata.
Three important reasons make the description of documents a complex task:
For this reason, a multi-media document is considered, as semi-structured: its structure is in priori unknown, irregular and has no generic definition. But tools that perform the content elicitation and analysis allow the identification of fundamental elements of the structure that can be deduced from the generic structures of collection.
Events, objects, elements... composing the document, can thus be identified depending on the automatically generated structure using content elicitation and analysis tools.
The second part of the problem consists of studying the manipulation and exploration mechanisms of these collections, according to the analysed and used profiles, integrating the expression of preferences and nuances in the information requirements analysis step.
Our annotation problems are addressed to the elicitation and the annotation of concerned documents. The semi-structured character of documents, the heterogeneity of contents and formats force a preliminary treatment to homogenize the representation structure and the description of these documents. We apply also various generic cores of rewriting, indexing and segmentation, developed within the team, these steps principally consists of the recognition of possible elements of structure and some information describing the format and the content of the document. The strong points of this approach initially consists of the not imposing a level or a vocabulary, and then specifying semantic tags in a standard way without imposing in priori a certain level of granularity.
The introduction of the fuzzy techniques into spatial or temporal operators makes it possible to manage the flexibility while exploring collections of documents of partial graphs, of queries, in order to avoid empty answer sets. The proposal for a flexible query treatment model adapted to the semi-structured documents and to the qualitative human reasoning, by taking into account not only the content, but also the structure of these documents, was established. The originality in the implementation of functions of similarity lies in its capacity to integrate the requirements of the exploratory analysis of huge databases of documents based on a multidimensional principle of description, integrating the multi-structurality that results from this (logic design, semantic, temporal, space...).
The design of decision support systems in companies is a very complex task: suitability of decisional databases for analytical needs of decision-makers is questioning with regard to new challenges for decisional information systems. The main axes that we study are the specification of new decisional database design as well as the definition of adapted decisional languages for decision-makers.
New multidimensional models for decisional data representation have to be formalized in order to support database change with regard to changes that occur both at source level and at user level; i.e. sources may undergo a complete change, and even they may disappear or appear while user analyses are changed, multidimensional structures must support efficiently these changes. These structural and data changes occur linearly, according to the time-line, but changes may also occur according to multiple alternative versions, which are time-stamped with variable time intervals. These needs require the definition of “multi-versionable” star-schemas, which allow in advance analyses of optional management scenarios. The provided models must support both data “historisation” and multidimensional schemas. Storing changes is crucial for exploiting completely known past data, approximating current data and/or partially known future data. The data “historisation” process in a multidimensional database context generates massive complex data, which are organised according to a set of temporal materialised views. Flexible mechanisms must be defined for supporting multi-granular “historisation” as well as new calculus algorithms over these sets of views. Finally, models must support time-variant atypical data, which have sometimes non-well defined structures, and simultaneously, the incremental and dynamic refreshment of the extraction processes must be provided.
Multidimensional query algebra must be defined for perfectly defining OLAP manipulation operations, which occur on the multidimensional structures. This algebra has to be defined with formal mathematical specifications in order to define a closed minimal core of OLAP operators. This algebra must be used to develop languages for providing a set of operations allowing classical multidimensional analysis activities (rotate, drill-down, roll-up...). It is important to define new operations supporting multidimensional analyses handling data, which are defined with time-variant structures and “multi-versionalised” data, which are time-stamped with multi-granular times. These operations have to be based on formalisms adapted for decision-makers.
Research that is carried on in this group aims at proposing models of textual information retrieval ; texts being structured, semi-structured or not structured. This work is based on the notion of context. Contextual information retrieval makes reference to tacit or explicit knowledge concerning the intentions of the user, the user’environment and the system itself. The hypothesis we make is that making explicit certain components of the context could improve the performances of systems. It is thus a question for us of defining models allowing to characterize the contexts of search, to recognize a context when a user interacts with the system, and to define the methods to retrieve documents the most adapted to a given context. More specifically, we integrate into our models resources (reference corpus, thesaurus, ontology), their format (XML, notion of paragraph or sentence, meta-data) as well as the searching tasks (Detection of document novelty, passage retrieval, science monitoring). To evaluate our models and systems, we participate to the campaigns of evaluation TREC, CLEF and INEX.
Topics of research are :
This group aims at developing scalable Information Retrieval (IR) systems. Our main challenge is how to manage large amounts of heterogeneous (in both content and structure) information effectively and efficiently in order to provide the most “appropriate” answer for a user’s information needs. Our investigations are based on both theory and experiment, aimed at developing, effective, and efficient retrieval approaches for all types of information. The group’s interests include areas such:
The group takes part in a number of world-wide evaluation exercises in IR including TREC, CLEF, INEX.