TB Treatment Adherence
Predicting which tuberculosis patients will drop off treatment before it happens, enabling targeted intervention across India's national TB programme.
The Problem
Why TB treatment drop-off is a public health crisis
Tuberculosis is one of the deadliest infectious diseases in the world. In 2020, it caused 1.5 million deaths from 10 million new cases globally. India carries the world's largest TB burden, with 2.59 million estimated cases in 2020 alone, making treatment adherence a national priority.
Drug-sensitive TB is curable through a six-month drug regimen, but treatment only works if patients complete it. Non-adherence is a leading cause of drug-resistant TB, a far harder and more expensive disease to treat. India's programme classifies a patient as loss-to-follow-up (LFU) when they miss 30 or more consecutive days of treatment.
In 2020, roughly 3% of drug-sensitive TB patients and up to 13% of drug-resistant patients became LFU. These patients often continue transmitting disease in their communities while the window for treatment closes. The downstream cost is higher mortality, drug resistance, and longer, more expensive care.
Field health workers are responsible for monitoring every patient under India's DOTS protocol, but cannot give intensive attention to every person on their caseload. The goal was to identify, early in treatment, which patients are most at risk of dropping off, so limited health worker bandwidth can be directed where it will have the greatest impact.
Data
678,952 patients across four Indian states
We used anonymised 2020 Nikshay data from Karnataka, Maharashtra, Uttar Pradesh, and West Bengal. Nikshay is India's national TB patient management system, tracking each patient longitudinally from diagnosis through treatment completion.
Uttar Pradesh
Patients
376,028
Maharashtra
Patients
157,997
West Bengal
Patients
79,807
Karnataka
Patients
65,120
Data was forward-split chronologically: the last six months of 2020 (325,190 patients) were held out as a passive evaluation split, used only for final reporting. This mimics the real deployment scenario where the model is trained on past data and scores future patients.
System Design
From diagnosis to differentiated care
The ML model slots into the existing Nikshay workflow. Risk scores are surfaced directly in patient profiles, requiring no change to how field staff interact with the system.
TB Diagnosis
Patient diagnosed TB-positive at a health facility. Treatment initiated and patient registered in Nikshay, India's national TB patient management system.
Data Collection
Patient demographics, facility details, comorbidities, lab results, and adherence history are recorded across seven Nikshay registers.
Risk Scoring
The AI model computes a loss-to-follow-up risk score using similarity-encoded features. High-cardinality categoricals like district and facility type are handled via similarity encoding.
Risk Stratification
Patients in the top 20% of risk scores are flagged as high-risk. This threshold is calibrated to match the realistic intervention bandwidth of field health workers.
Differentiated Care
High-risk patients receive intensified monitoring: home visits, phone follow-ups, and counselling sessions. Low-risk patients continue routine care.
Modelling
Key technical decisions
Encoding
Similarity Encoding for high-cardinality categoricals
Nearly all features are categorical (district, facility type, drug regimen). Similarity encoding, which captures character-level proximity between category strings, outperformed all alternatives including target encoding, entity embeddings, and MinHash encoding, achieving a 98.73% lift over the best rule-based baseline in the modeling split.
Metric
Recall@k as the evaluation objective
Standard classification metrics are ill-suited to this problem: target prevalence is only 3-4% and field workers can only act on a fixed fraction of the patient list. We use Recall@20 (recall when targeting the riskiest 20% of patients) and AvRecall(10,40) (average across 10-40%) as primary metrics, matching real-world deployment constraints.
Models
Ensemble of gradient-boosted models
We evaluated 11 model classes including XGBoost, LightGBM, CatBoost, EBM, TabNet, and deep MLP. Gradient-boosted trees outperformed deep learning on this large real-world tabular dataset. The final average ensemble of the top 5 models achieves the best overall performance, with EBM providing interpretability alongside XGBoost.
Fairness
Algorithmic fairness for underserved geographies
The model showed variation across districts and gender cohorts. We investigated two post-hoc approaches: data augmentation (oversampling underperforming districts) and fairness score reweighting (adjusting scores to equalise Recall@20 across cohorts). Both substantially raised recall in the lowest-performing districts.
Results
Model performance: Recall@20 on passive evaluation
All GBDT models significantly outperform the best rule-based baseline. The average ensemble edges out individual models by leveraging low output correlation between EBM and the tree models.
Recall@20 on passive evaluation split. Source: Table 3, Kulkarni et al. (ML4H 2022)
0.797
AUC-ROC
33.1% better than best baseline
204%
AUC-PR lift
Average precision vs. best rule baseline
0.924
DRTB Recall@20
Highest-risk, highest-mortality cohort
Fairness
Equitable recall across geographies and cohorts
The base model showed performance variation across districts and gender. We addressed this with data augmentation and post-hoc fairness score reweighting, evaluated across three cohort types.
Underperforming districts (Recall@20 < 0.3)
Gender: Female cohort
Drug-resistant TB (DRTB)
Source: Section 5.6 and Appendix I, Kulkarni et al. (ML4H 2022)
Impact
Deployed with India's Central TB Division
For 325,190 patients in the passive evaluation split, the average ensemble model predicts LFU early enough to enable timely intervention. At 20% patient targeting, this corresponds to saving 5,587 patients from becoming LFU over a six-month period across four states, compared to 1,808 with random targeting and 2,974 with the best rule-based baseline.
The model generalises well across time, with no significant reduction in Recall@20 month-on-month across the passive evaluation split, from July through December 2020. This robustness to distributional shift, including the heightened COVID-19 burden on the healthcare system, is a key deployment requirement.
As the official AI partner of India's Central TB Division, Wadhwani AI is working on pilots across multiple cities and states with the goal of pan-India deployment across all 780 districts. At that scale, the solution would impact over one million TB patients annually.
Risk scores are integrated directly into Nikshay patient profiles. High-risk patients are flagged for intensified care (home visits, phone follow-ups, counselling) while low-risk patients continue routine monitoring, ensuring that existing health worker bandwidth is directed where it matters most.
Read the full paper
Technical details on similarity encoding, fairness analysis, interpretability (PFI and LIME), and cohort-level evaluation are in the Best Paper award-winning publication at ML4H @ NeurIPS 2022.