Paper 5

DiNer – On Building Multilingual Disease-News Profiler

Authors: Sajal Rustagi, Dhaval Patel

Volume 43 (2020)

Abstract

Disease-News Profiler aims to gather a collection of online news articles containing information related to diseases. A need for such profiler arises in epidemic intelligence where it acts as an information system for diseases. It can be used by health agencies and researchers to track any epidemic or to develop a knowledge base for diseases. Much of the existing profiling techniques have targeted specific languages like English, Arabic, Chinese, Spanish or Russian but have largely ignored many Asian and resource-poor languages. Building a multilingual disease-news profiler has a huge advantage in terms of coverage, timeliness, quality and information enrichment. In this paper we propose a novel system, DiNer for filtering and indexing of Disease-News. We have developed a language agnostic and low-resource based filtering technique which uses a Support Vector Machine based classifier to identify instances of Disease-news from any given news corpus. In this paper, we describe our novel approach of feature engineering and the development of Disease-Related corpus for training our SVM classifier. We have tested our filtering module on four languages – English, Hindi, Punjabi and Gujarati. Our filtering technique performs significantly better than the baseline-approach both in terms of F-Score(>5%) and recall(>50%) across languages.

Keywords: Disease recognition, Surveillance systems, Machine learning, Classification.