← Back to Research
NLP for Positive Impact Workshop, ACL 2025

Health Sentinel

An AI Pipeline for Real-time Disease Outbreak Detection

Read the paper →
400M+ Media sources scanned
100K+ Alerts issued
5K+ Verified outbreaks

The Problem

Why real-time outbreak detection matters

India's National Centre for Disease Control (NCDC) operates a Media Scanning and Verification Cell that monitors news and social media for early signs of disease outbreaks. Before Health Sentinel, this process was largely manual — epidemiologists read through thousands of articles daily across dozens of languages, inevitably missing signals and reacting late.

The challenge: build an automated, multi-lingual pipeline that can scan millions of media sources in real time, classify potential outbreak reports with high precision, and surface verified alerts for rapid human review — all while minimising false positives that waste epidemiologists' time.

Approach

The detection pipeline

A multi-stage architecture that progressively filters and verifies outbreak signals.

Articles flow left → right through each stage. Hover a stage to learn more.

How it works

Stage-by-stage breakdown

1. Rule-based Filters

Keyword and pattern-based filters in multiple Indian languages scan 400M+ media sources, discarding clearly irrelevant content and forwarding candidate articles. This stage reduces the data volume by over 99%, making downstream ML tractable.

2. BERT Classification

A fine-tuned multilingual BERT model classifies surviving articles by disease type and outbreak likelihood. The model handles noisy web data, short social media posts, and regional language variations across Hindi, Bengali, Tamil, Telugu, and more.

3. LLM Verification

Large language models perform structured extraction — disease name, location, date, severity, and source credibility — and verify outbreak signals against known patterns, significantly reducing the false positive rate.

4. Clustering

Verified reports are clustered by disease, geography, and time window. Multiple articles about the same event are grouped into a single coherent outbreak signal, with a confidence score reflecting the breadth and reliability of sources.

5. Alert Issued

Consolidated alerts are pushed to NCDC's Media Scanning and Verification Cell. Each alert includes disease, region, confidence score, source count, and links to original articles, enabling epidemiologists to act within hours of an emerging outbreak.

In Action

Alert dashboard

A sample of the alerts surfaced by the pipeline (illustrative data).

Alert ID Disease Region Confidence Sources Status
ALT-2024-4821 Dengue Maharashtra 94% 37 Verified
ALT-2024-4819 Cholera West Bengal 88% 22 Verified
ALT-2024-4815 Malaria Odisha 91% 45 Under Review
ALT-2024-4812 Typhoid Bihar 76% 14 Verified
ALT-2024-4808 Leptospirosis Kerala 83% 19 Under Review
ALT-2024-4805 Dengue Tamil Nadu 97% 52 Verified

Impact

Real-world deployment

Health Sentinel is deployed and operational within India's NCDC. The pipeline runs continuously, scanning media in real time and has to date processed over 400 million media sources, issued more than 100,000 alerts, and contributed to the verification of over 5,000 disease outbreaks. By automating the initial screening and verification process, it has dramatically reduced the time between an outbreak signal appearing in media and a public health response being initiated.

The work was published at the NLP for Positive Impact Workshop at ACL 2025.

Read the full paper

For technical details on model architecture, training, evaluation, and deployment, see the published paper.

ACL Anthology → ← All research