Health Sentinel
An AI pipeline for real-time disease outbreak detection, deployed with India's National Centre for Disease Control across 13 languages.
The Problem
Why traditional disease surveillance falls short
Traditional disease surveillance is systematic but slow. It collects data from healthcare workers, hospitals, and public health facilities through official reporting chains. Reporting delays are common, and under-reporting in rural and remote areas means many outbreaks are caught late or missed entirely.
Event-based surveillance offers an earlier signal. It monitors informal sources like news articles, social media posts, and online media for unusual health events. But India's media landscape spans dozens of languages, hundreds of regional outlets, and millions of daily articles. No human team can cover it at scale.
Before Health Sentinel, India's NCDC Media Scanning and Verification Cell relied on public health experts reading through articles by hand. This approach was limited in scope, slow to respond, and could not realistically monitor more than a fraction of available sources for a country of 1.4 billion people.
The task was to build a fully automated, multilingual pipeline capable of scanning the entire public internet for outbreak signals in real time, extracting structured health events from noisy text, and surfacing verified alerts for expert review within hours of a signal first appearing.
System Design
An 8-stage detection pipeline
Health Sentinel chains ML and rule-based components sequentially, each stage narrowing the data volume while preserving recall.
Data Ingestion
Common Crawl, Google Alerts configured across 13 Indian languages, and custom crawlers continuously fetch articles from 500K+ unique domains.
Article Classification
Fine-tuned BERT-family classifiers filter out ~87% of articles that carry no health event information. Separate best-performing models selected per language.
Translation
IndicTrans2 translates non-English articles to English, enabling consistent downstream ML across all 13 supported languages.
Disease & Location Filtering
BioBERT NER combined with curated keyword lists confirms each article references at least one of 122 monitored diseases and an Indian location.
Event Extraction
GPT-4o-Mini extracts structured events via few-shot prompting, capturing Disease, Location, Incident, Incident Type, and Count from article text.
Disease & Location Mapping
An LLM maps extracted names to standardised disease codes and administrative locations (state, district, sub-district).
Clustering
Sentence transformer embeddings paired with DFS graph search group duplicate reports of the same event into unique clusters. Average ARI 0.89.
Human Review
NCDC epidemiologists at the Media Scanning and Verification Cell review clustered events against on-ground indicators before publishing.
Core Capability
Structured event extraction from noisy text
A single article can report multiple distinct health events across diseases, locations, and incident types. The pipeline identifies and separates each one.
Input: News Article
"Two die of dengue in Mizoram, 1 in Manipur. Meanwhile fortysix cases of Chikungunya have been detected so far in Assam taking the total number of infections to 70"
Output: Extracted Events (GPT-4o-Mini)
Each extracted event is a structured record capturing five fields: Disease, Location, Incident (case or death), Incident Type (new or total), and Number. GPT-4o-Mini handles both numbered events and numberless events, the latter using NLI-style hypothesis testing to surface implicit signals.
Evaluation
LLMs vs. traditional extraction methods
We evaluated five approaches across precision, recall, F1, exact match accuracy, and detection rate. LLMs significantly outperform QA+NLI methods.
Source: Table 2, Health Sentinel (ACL 2025)
GPT-4o-Mini achieves the best overall F1 of 0.68, correctly filtering international events, identifying diseases missed by rule-based methods, and handling implicit health signals without numerical data. Open-source models Llama 3.1-8b and Gemma2-9b show competitive performance, suggesting viable paths to cost reduction.
Multilingual Coverage
13 Indian languages, each with a dedicated classifier
Health Sentinel is the first disease surveillance system to support media scanning across 13 Indian languages. A best-performing model was selected per language based on validation set recall. All classifiers achieve 95-97% F1.
English
RoBERTa-base
Hindi
MuRIL-base
Telugu
XLM-RoBERTa
Kannada
MuRIL-base
Gujarati
MuRIL-base
Tamil
MuRIL-base
Punjabi
XLM-RoBERTa
Bengali
XLM-RoBERTa
Marathi
XLM-RoBERTa
Malayalam
MuRIL-base
Oriya
XLM-RoBERTa
Assamese
MuRIL-base
Urdu
XLM-RoBERTa
Non-English articles are translated to English via IndicTrans2 before event extraction, preserving named entities like disease names and locations.
Deduplication
Clustering to isolate unique events
A single outbreak is often reported by dozens of media outlets simultaneously. Without deduplication, the same event would flood the review queue as hundreds of separate alerts.
Health Sentinel uses sentence transformer embeddings to compute pairwise similarity between all articles from a given day. A curated rule set converts similarity scores to a binary match matrix, and Depth First Search over the resulting graph identifies disjoint clusters of unique events.
Source: Table 4, Health Sentinel (ACL 2025)
Deployment
Live at India's NCDC since April 2022
Health Sentinel launched with English and Hindi support in April 2022 and has expanded to 13 languages over two years. It runs continuously, processing around 375,000 articles per day and identifying approximately 150 unique health events daily.
To date, Health Sentinel has processed over 300 million articles from 500,000+ unique domains and identified over 95,000 unusual health events. Of these, more than 3,500 were shortlisted by public health experts at NCDC as potential outbreaks warranting further investigation.
The impact on the surveillance capability has been significant. The number of published events saw a 150% increase compared to years when only human-based surveillance existed. In 2024, 96% of all health events published by the surveillance system were extracted by Health Sentinel, with only 4% found through manual media scanning.
The number of media sources covered has grown exponentially thanks to automated scanning and multilingual support across 13 languages.
Read the full paper
Technical details on model architecture, training, evaluation benchmarks, and deployment are available in the published paper at ACL 2025.