Health Sentinel
An AI Pipeline for Real-time Disease Outbreak Detection
Read the paper →The Problem
Why real-time outbreak detection matters
India's National Centre for Disease Control (NCDC) operates a Media Scanning and Verification Cell that monitors news and social media for early signs of disease outbreaks. Before Health Sentinel, this process was largely manual — epidemiologists read through thousands of articles daily across dozens of languages, inevitably missing signals and reacting late.
The challenge: build an automated, multi-lingual pipeline that can scan millions of media sources in real time, classify potential outbreak reports with high precision, and surface verified alerts for rapid human review — all while minimising false positives that waste epidemiologists' time.
Approach
The detection pipeline
A multi-stage architecture that progressively filters and verifies outbreak signals.
Articles flow left → right through each stage. Hover a stage to learn more.
How it works
Stage-by-stage breakdown
1. Rule-based Filters
Keyword and pattern-based filters in multiple Indian languages scan 400M+ media sources, discarding clearly irrelevant content and forwarding candidate articles. This stage reduces the data volume by over 99%, making downstream ML tractable.
2. BERT Classification
A fine-tuned multilingual BERT model classifies surviving articles by disease type and outbreak likelihood. The model handles noisy web data, short social media posts, and regional language variations across Hindi, Bengali, Tamil, Telugu, and more.
3. LLM Verification
Large language models perform structured extraction — disease name, location, date, severity, and source credibility — and verify outbreak signals against known patterns, significantly reducing the false positive rate.
4. Clustering
Verified reports are clustered by disease, geography, and time window. Multiple articles about the same event are grouped into a single coherent outbreak signal, with a confidence score reflecting the breadth and reliability of sources.
5. Alert Issued
Consolidated alerts are pushed to NCDC's Media Scanning and Verification Cell. Each alert includes disease, region, confidence score, source count, and links to original articles, enabling epidemiologists to act within hours of an emerging outbreak.
In Action
Alert dashboard
A sample of the alerts surfaced by the pipeline (illustrative data).
| Alert ID | Disease | Region | Confidence | Sources | Status |
|---|---|---|---|---|---|
| ALT-2024-4821 | Dengue | Maharashtra | 94% | 37 | Verified |
| ALT-2024-4819 | Cholera | West Bengal | 88% | 22 | Verified |
| ALT-2024-4815 | Malaria | Odisha | 91% | 45 | Under Review |
| ALT-2024-4812 | Typhoid | Bihar | 76% | 14 | Verified |
| ALT-2024-4808 | Leptospirosis | Kerala | 83% | 19 | Under Review |
| ALT-2024-4805 | Dengue | Tamil Nadu | 97% | 52 | Verified |
Impact
Real-world deployment
Health Sentinel is deployed and operational within India's NCDC. The pipeline runs continuously, scanning media in real time and has to date processed over 400 million media sources, issued more than 100,000 alerts, and contributed to the verification of over 5,000 disease outbreaks. By automating the initial screening and verification process, it has dramatically reduced the time between an outbreak signal appearing in media and a public health response being initiated.
The work was published at the NLP for Positive Impact Workshop at ACL 2025.
Read the full paper
For technical details on model architecture, training, evaluation, and deployment, see the published paper.