← Back to Research
NLP for Positive Impact Workshop, ACL 2025

Health Sentinel

An AI pipeline for real-time disease outbreak detection, deployed with India's National Centre for Disease Control across 13 languages.

Read the paper → ← All research
400M+ Articles processed From 500K+ unique domains
95K+ Health events identified Across India since Apr 2022
3,500+ Outbreaks verified Shortlisted by NCDC experts
13 Indian languages Including 12 Indic languages
96% Published events in 2024 Extracted by Health Sentinel

The Problem

Why traditional disease surveillance falls short

Traditional disease surveillance is systematic but slow. It collects data from healthcare workers, hospitals, and public health facilities through official reporting chains. Reporting delays are common, and under-reporting in rural and remote areas means many outbreaks are caught late or missed entirely.

Event-based surveillance offers an earlier signal. It monitors informal sources like news articles, social media posts, and online media for unusual health events. But India's media landscape spans dozens of languages, hundreds of regional outlets, and millions of daily articles. No human team can cover it at scale.

Before Health Sentinel, India's NCDC Media Scanning and Verification Cell relied on public health experts reading through articles by hand. This approach was limited in scope, slow to respond, and could not realistically monitor more than a fraction of available sources for a country of 1.4 billion people.

The task was to build a fully automated, multilingual pipeline capable of scanning the entire public internet for outbreak signals in real time, extracting structured health events from noisy text, and surfacing verified alerts for expert review within hours of a signal first appearing.

System Design

An 8-stage detection pipeline

Health Sentinel chains ML and rule-based components sequentially, each stage narrowing the data volume while preserving recall.

01
Ingestion3 sources

Data Ingestion

Common Crawl, Google Alerts configured across 13 Indian languages, and custom crawlers continuously fetch articles from 500K+ unique domains.

02
Processing96% F1

Article Classification

Fine-tuned BERT-family classifiers filter out ~87% of articles that carry no health event information. Separate best-performing models selected per language.

03
ProcessingIndicTrans2

Translation

IndicTrans2 translates non-English articles to English, enabling consistent downstream ML across all 13 supported languages.

04
Processing122 diseases

Disease & Location Filtering

BioBERT NER combined with curated keyword lists confirms each article references at least one of 122 monitored diseases and an Indian location.

05
ProcessingF1: 0.68

Event Extraction

GPT-4o-Mini extracts structured events via few-shot prompting, capturing Disease, Location, Incident, Incident Type, and Count from article text.

06
ProcessingStandardised

Disease & Location Mapping

An LLM maps extracted names to standardised disease codes and administrative locations (state, district, sub-district).

07
ProcessingARI: 0.89

Clustering

Sentence transformer embeddings paired with DFS graph search group duplicate reports of the same event into unique clusters. Average ARI 0.89.

08
Human ReviewNCDC MSVC

Human Review

NCDC epidemiologists at the Media Scanning and Verification Cell review clustered events against on-ground indicators before publishing.

Core Capability

Structured event extraction from noisy text

A single article can report multiple distinct health events across diseases, locations, and incident types. The pipeline identifies and separates each one.

Input: News Article

"Two die of dengue in Mizoram, 1 in Manipur. Meanwhile fortysix cases of Chikungunya have been detected so far in Assam taking the total number of infections to 70"

Multi-diseaseMulti-locationMixed incident types

Output: Extracted Events (GPT-4o-Mini)

DengueMizoramDeathNew2
DengueManipurDeathNew1
ChikungunyaAssamCaseNew46
ChikungunyaAssamCaseTotal70

Each extracted event is a structured record capturing five fields: Disease, Location, Incident (case or death), Incident Type (new or total), and Number. GPT-4o-Mini handles both numbered events and numberless events, the latter using NLI-style hypothesis testing to surface implicit signals.

Evaluation

LLMs vs. traditional extraction methods

We evaluated five approaches across precision, recall, F1, exact match accuracy, and detection rate. LLMs significantly outperform QA+NLI methods.

ModelF1Exact MatchDetection Rate
QA + NLI (baseline)0.400.370.70
Llama 3.1-8b0.500.430.95
Gemma2-9b0.520.450.96
GPT-3.5-Turbo0.610.540.95
GPT-4o-MiniBEST0.680.610.92

Source: Table 2, Health Sentinel (ACL 2025)

GPT-4o-Mini achieves the best overall F1 of 0.68, correctly filtering international events, identifying diseases missed by rule-based methods, and handling implicit health signals without numerical data. Open-source models Llama 3.1-8b and Gemma2-9b show competitive performance, suggesting viable paths to cost reduction.

Multilingual Coverage

13 Indian languages, each with a dedicated classifier

Health Sentinel is the first disease surveillance system to support media scanning across 13 Indian languages. A best-performing model was selected per language based on validation set recall. All classifiers achieve 95-97% F1.

English

RoBERTa-base

97%

Hindi

MuRIL-base

96%

Telugu

XLM-RoBERTa

96%

Kannada

MuRIL-base

97%

Gujarati

MuRIL-base

96%

Tamil

MuRIL-base

97%

Punjabi

XLM-RoBERTa

96%

Bengali

XLM-RoBERTa

96%

Marathi

XLM-RoBERTa

96%

Malayalam

MuRIL-base

95%

Oriya

XLM-RoBERTa

95%

Assamese

MuRIL-base

95%

Urdu

XLM-RoBERTa

96%

Non-English articles are translated to English via IndicTrans2 before event extraction, preserving named entities like disease names and locations.

Deduplication

Clustering to isolate unique events

A single outbreak is often reported by dozens of media outlets simultaneously. Without deduplication, the same event would flood the review queue as hundreds of separate alerts.

Health Sentinel uses sentence transformer embeddings to compute pairwise similarity between all articles from a given day. A curated rule set converts similarity scores to a binary match matrix, and Depth First Search over the resulting graph identifies disjoint clusters of unique events.

Average ARI 0.89
Average NMI 0.98
Average V-Measure 0.98
Evaluation dates 7
Events evaluated 869

Source: Table 4, Health Sentinel (ACL 2025)

Deployment

Live at India's NCDC since April 2022

Health Sentinel launched with English and Hindi support in April 2022 and has expanded to 13 languages over two years. It runs continuously, processing around 375,000 articles per day and identifying approximately 150 unique health events daily.

To date, Health Sentinel has processed over 300 million articles from 500,000+ unique domains and identified over 95,000 unusual health events. Of these, more than 3,500 were shortlisted by public health experts at NCDC as potential outbreaks warranting further investigation.

The impact on the surveillance capability has been significant. The number of published events saw a 150% increase compared to years when only human-based surveillance existed. In 2024, 96% of all health events published by the surveillance system were extracted by Health Sentinel, with only 4% found through manual media scanning.

The number of media sources covered has grown exponentially thanks to automated scanning and multilingual support across 13 languages.

150% Increase in published events vs. human-only surveillance
96% Of 2024 surveillance events extracted automatically by Health Sentinel
375K Articles processed every day, identifying ~150 unique health events

Read the full paper

Technical details on model architecture, training, evaluation benchmarks, and deployment are available in the published paper at ACL 2025.

Publication → ← All research