Paper 5 – TLDKS Journal

The HiLeX System for Semantic Information Extraction

Authors: Marco Manna, Ermelinda Oro, Massimo Ruffolo, Mario Alviano, and Nicola Leone

Volume 5 (2012)

Abstract

The explosive growth and popularity of theWeb has resulted in a huge amount of digital information sources on the Internet. Unfortu- nately, such sources only manage data, rather than the knowledge they carry. Recognizing, extracting, and structuring relevant information ac- cording to their semantics is a crucial task. Several approaches in the ¯eld of Information Extraction (IE) have been proposed to support the translation of semi-structured/unstructured documents into structured data or knowledge. Most of them have a high precision but, since they are mainly syntactic, they often have a low recall, are dependent on the do- cument format, and ignore the semantics of information they extract. In this paper, we describe a new approach for semantic information extrac- tion that could represent the basis for automatically extracting highly structured data from unstructured web sources without any undesirable trade-o® between precision and recall. In short, the approach (i) is onto- logy driven, (ii) is based on a uni¯ed representation of documents, (iii) integrates existing IE techniques, (iv) implements semantic regular ex- pressions, (v) has been implemented through Answer Set Programming, (vi) is employed in real-world applications, and (vii) is having a positive feedback from business customers.