Paper 8

Improving Cross-Document Knowledge Discovery through Content and Link Analysis of Wikipedia Knowledge

Authors: Peng Yan and Wei Jin

Volume 21 (2015)

Abstract

The Vector Space Model (VSM) has been widely used in Natural Language Processing (NLP) for representing text documents as a Bag of Words (BOW). However, only document-level statistical information is recorded (e.g., document frequency, inverse document frequency) and word semantics cannot be captured. Improvement towards understanding the meaning of words in texts is a challenging task and sufficient background knowledge may need to be incorporated to provide a better semantic representation of texts. In this paper, we present a text mining model that can automatically discover semantic relationships between concepts across multiple documents (where the traditional search paradigm such as search engines cannot help much) and effectively integrate various evidences mined from Wikipedia knowledge. We propose this integration may effectively complement existing information contained in text corpus and facilitate the construction of a more comprehensive representation and retrieval framework. The experimental results demonstrate the search performance has been significantly enhanced against two competitive baselines.