Machine Learning for Health (ML4H), co-located with NeurIPS 2022 Best Paper

TB Treatment Adherence

Predicting which tuberculosis patients will drop off treatment before it happens, enabling targeted intervention across India's national TB programme.

Read the paper → ← All research

678K+ Patients in dataset Four Indian states, 2020 Nikshay data

0.627 Best Recall@20 Avg. ensemble on passive evaluation

214% Lift over random vs. random patient selection at k=20

5,587 Patients saved (6 months) vs. 1,808 with random targeting

0.924 DRTB Recall@20 Best performance on highest-risk cohort

The Problem

Why TB treatment drop-off is a public health crisis

Tuberculosis is one of the deadliest infectious diseases in the world. In 2020, it caused 1.5 million deaths from 10 million new cases globally. India carries the world's largest TB burden, with 2.59 million estimated cases in 2020 alone, making treatment adherence a national priority.

Drug-sensitive TB is curable through a six-month drug regimen, but treatment only works if patients complete it. Non-adherence is a leading cause of drug-resistant TB, a far harder and more expensive disease to treat. India's programme classifies a patient as loss-to-follow-up (LFU) when they miss 30 or more consecutive days of treatment.

In 2020, roughly 3% of drug-sensitive TB patients and up to 13% of drug-resistant patients became LFU. These patients often continue transmitting disease in their communities while the window for treatment closes. The downstream cost is higher mortality, drug resistance, and longer, more expensive care.

Field health workers are responsible for monitoring every patient under India's DOTS protocol, but cannot give intensive attention to every person on their caseload. The goal was to identify, early in treatment, which patients are most at risk of dropping off, so limited health worker bandwidth can be directed where it will have the greatest impact.

Data

678,952 patients across four Indian states

We used anonymised 2020 Nikshay data from Karnataka, Maharashtra, Uttar Pradesh, and West Bengal. Nikshay is India's national TB patient management system, tracking each patient longitudinally from diagnosis through treatment completion.

Uttar Pradesh

Patients

376,028

LFU rate3.39%

Maharashtra

Patients

157,997

LFU rate2.51%

West Bengal

Patients

79,807

LFU rate2.05%

Karnataka

Patients

65,120

LFU rate2.67%

Data was forward-split chronologically: the last six months of 2020 (325,190 patients) were held out as a passive evaluation split, used only for final reporting. This mimics the real deployment scenario where the model is trained on past data and scores future patients.

System Design

From diagnosis to differentiated care

The ML model slots into the existing Nikshay workflow. Risk scores are surfaced directly in patient profiles, requiring no change to how field staff interact with the system.

Nikshay

TB Diagnosis

Patient diagnosed TB-positive at a health facility. Treatment initiated and patient registered in Nikshay, India's national TB patient management system.

7 registers

Data Collection

Patient demographics, facility details, comorbidities, lab results, and adherence history are recorded across seven Nikshay registers.

Recall@20: 0.62

Risk Scoring

The AI model computes a loss-to-follow-up risk score using similarity-encoded features. High-cardinality categoricals like district and facility type are handled via similarity encoding.

Top 20%

Risk Stratification

Patients in the top 20% of risk scores are flagged as high-risk. This threshold is calibrated to match the realistic intervention bandwidth of field health workers.

Intervention

Differentiated Care

High-risk patients receive intensified monitoring: home visits, phone follow-ups, and counselling sessions. Low-risk patients continue routine care.

Modelling

Key technical decisions

Encoding

Similarity Encoding for high-cardinality categoricals

Nearly all features are categorical (district, facility type, drug regimen). Similarity encoding, which captures character-level proximity between category strings, outperformed all alternatives including target encoding, entity embeddings, and MinHash encoding, achieving a 98.73% lift over the best rule-based baseline in the modeling split.

Metric

Recall@k as the evaluation objective

Standard classification metrics are ill-suited to this problem: target prevalence is only 3-4% and field workers can only act on a fixed fraction of the patient list. We use Recall@20 (recall when targeting the riskiest 20% of patients) and AvRecall(10,40) (average across 10-40%) as primary metrics, matching real-world deployment constraints.

Models

Ensemble of gradient-boosted models

We evaluated 11 model classes including XGBoost, LightGBM, CatBoost, EBM, TabNet, and deep MLP. Gradient-boosted trees outperformed deep learning on this large real-world tabular dataset. The final average ensemble of the top 5 models achieves the best overall performance, with EBM providing interpretability alongside XGBoost.

Fairness

Algorithmic fairness for underserved geographies

The model showed variation across districts and gender cohorts. We investigated two post-hoc approaches: data augmentation (oversampling underperforming districts) and fairness score reweighting (adjusting scores to equalise Recall@20 across cohorts). Both substantially raised recall in the lowest-performing districts.

Results

Model performance: Recall@20 on passive evaluation

All GBDT models significantly outperform the best rule-based baseline. The average ensemble edges out individual models by leveraging low output correlation between EBM and the tree models.

Avg. EnsembleBEST

0.627

CatBoost

0.625

XGBoost

0.624

LightGBM

0.620

Random Forest

0.613

EBM (GAM)

0.606

Regularised MLP

0.566

TabNet

0.564

Decision Tree

0.469

Best Rule Baseline

0.314

Random (k = 20%)

0.200

Recall@20 on passive evaluation split. Source: Table 3, Kulkarni et al. (ML4H 2022)

0.797

AUC-ROC

33.1% better than best baseline

204%

AUC-PR lift

Average precision vs. best rule baseline

0.924

DRTB Recall@20

Highest-risk, highest-mortality cohort

Fairness

Equitable recall across geographies and cohorts

The base model showed performance variation across districts and gender. We addressed this with data augmentation and post-hoc fairness score reweighting, evaluated across three cohort types.

Before intervention

After data augmentation

After fairness reweighting

Underperforming districts (Recall@20 < 0.3)

Before

0.22

After augmentation

0.41

After fairness reweighting

0.51

Gender: Female cohort

Before

0.51

After augmentation

0.54

After fairness reweighting

0.56

Drug-resistant TB (DRTB)

Before

0.92

After augmentation

0.93

After fairness reweighting

0.92

Source: Section 5.6 and Appendix I, Kulkarni et al. (ML4H 2022)

Impact

Deployed with India's Central TB Division

For 325,190 patients in the passive evaluation split, the average ensemble model predicts LFU early enough to enable timely intervention. At 20% patient targeting, this corresponds to saving 5,587 patients from becoming LFU over a six-month period across four states, compared to 1,808 with random targeting and 2,974 with the best rule-based baseline.

The model generalises well across time, with no significant reduction in Recall@20 month-on-month across the passive evaluation split, from July through December 2020. This robustness to distributional shift, including the heightened COVID-19 burden on the healthcare system, is a key deployment requirement.

As the official AI partner of India's Central TB Division, Wadhwani AI is working on pilots across multiple cities and states with the goal of pan-India deployment across all 780 districts. At that scale, the solution would impact over one million TB patients annually.

Risk scores are integrated directly into Nikshay patient profiles. High-risk patients are flagged for intensified care (home visits, phone follow-ups, counselling) while low-risk patients continue routine monitoring, ensuring that existing health worker bandwidth is directed where it matters most.

5,587 Patients saved from LFU in 6 months across 4 states

214% Lift over random targeting at 20% patient coverage

780 Districts targeted for pan-India deployment

Read the full paper

Technical details on similarity encoding, fairness analysis, interpretability (PFI and LIME), and cohort-level evaluation are in the Best Paper award-winning publication at ML4H @ NeurIPS 2022.

Read on arXiv → ← All research