Overview

Abstract

Background: Tuberculosis (TB) is a leading cause of death from infectious disease worldwide. The World Health Organization tracks treatment success rates for countries each year. We asked: what statistical model best captures the variation in these rates?

Objective: Compare three Bayesian models for country-year TB treatment success from 2012-2023: (1) a simple binomial model, (2) a beta-binomial model allowing overdispersion, and (3) a hierarchical beta-binomial model with country random effects.

Methods: We analyzed 1,862 country-years from 180 countries using WHO data. All models were fitted with MCMC via JAGS. We compared models using the Deviance Information Criterion (DIC) computed from the observed-data likelihood and posterior predictive checks (PPC). A reduced 30/30/10-replicate parameter recovery study supported inferential calibration.

Results: The hierarchical beta-binomial model (M3) is preferred.

  • DIC: M3 = 24,940 vs M2 = 27,160 vs M1 = 2,666,301
  • The binomial model (M1) severely underestimates variability
  • M3 captures both overdispersion (\(\phi = 42.9\)) and persistent country heterogeneity (\(\sigma_u = 0.72\))
  • Under M3: mortality is negatively associated with success, incidence is positively associated after controlling for country effects
  • Posterior predictive checks show that M3 calibrates the cohort-weighted mean well (Bayesian p = 0.32) and substantially improves variance recovery, although the variance check is still not perfectly centered (p = 0.04) and lower-tail miscalibration remains

Conclusion: TB treatment success varies more than binomial sampling alone can explain. Both overdispersion and persistent country effects are needed. M3 is the best model among those compared, though residual lower-tail miscalibration remains.

Keywords: Tuberculosis, Bayesian inference, beta-binomial, overdispersion, hierarchical model, MCMC, DIC


Main Finding

The hierarchical beta-binomial model (M3) is preferred among the three fitted models. It has the lowest DIC (24,940 vs 27,160 for M2 vs 2.6 million for M1) and the best posterior predictive performance among the three models. M3 captures both overdispersion (\(\phi = 42.9\)) and persistent country-level heterogeneity (\(\sigma_u = 0.72\)).


Introduction

This study uses Bayesian methods to determine what drives variation in TB treatment success across countries and years.

Background

Tuberculosis (TB) is one of the leading causes of death from a single infectious agent worldwide. The World Health Organization (WHO) tracks treatment outcomes for countries each year, publishing success rates as a key indicator of national program performance.

Research Gap

WHO provides descriptive statistics, but no formal Bayesian comparison has examined the statistical structure of this variability. The core question is:

Do differences in TB treatment success arise from simple sampling variability, or do we need models that account for overdispersion and persistent country effects?

Research Question

We compare three Bayesian models to find which best explains and reproduces TB treatment success.

Research question: Which Bayesian model best explains and reproduces country-year TB treatment success in 2012-2023: a binomial logistic model, a beta-binomial model, or a hierarchical beta-binomial model?

This question is:

  • Specific: It compares three well-defined models.
  • Measurable: We answer it using DIC and posterior predictive checks.
  • Fair: All models use the same locked dataset.

Data and EDA

The final dataset contains 1,862 country-years from 180 countries (2012-2023). All three models use this exact locked dataset.

Dataset Inventory

The project uses several WHO CSV tables, but all three Bayesian models are fitted only on a single locked analysis table. This ensures fair model comparison.

Raw WHO Files

File Rows Columns Main Role Key Column Groups Used in Model?
TB_outcomes_2026-04-04.csv 6,399 73 Treatment outcomes country, iso3, year, newrel_coh, newrel_succ, rel_with_new_flg Yes (response)
TB_burden_countries_2026-04-04.csv 5,347 50 Epidemiological burden e_inc_100k, e_mort_100k, c_cdr, e_tbhiv_prct Yes (predictors)
TB_data_dictionary_2026-04-04.csv 699 4 Variable definitions variable_name, definition Documentation
TB_notifications_2026-04-04.csv 9,567 216 Notification counts new_sp, new_sn, c_newinc Background only
TB_provisional_notifications_2026-04-04.csv 647 24 Provisional data m_01-m_12, q_1-q_4 Background only
main_analysis_table_locked.csv 1,862 17 Final modeling table success, cohort, predictors Yes (sole input)

Locked Analysis Table Schema

The locked table main_analysis_table_locked.csv is the sole input for M1, M2, and M3.

  • iso3: Three-letter ISO country code used to identify each country, such as AFG or ITA.
  • year: Calendar year of the observation.
  • country_id: Numeric ID used internally by JAGS to assign one country random effect to each country.
  • region_id: Numeric ID used internally by JAGS to assign the WHO region effect.
  • g_whoregion: WHO region label for the country, used to compare regions relative to the AFR baseline.
  • success: Number of patients in the cohort who completed treatment successfully.
  • cohort: Total number of patients included in the treatment outcome cohort.
  • prop_success: Observed success proportion, calculated as success divided by cohort. Used for summaries and posterior predictive checks, not as the modeled response.
  • e_inc_100k: Estimated TB incidence per 100,000 population before standardization.
  • e_mort_100k: Estimated TB mortality per 100,000 population before standardization.
  • c_cdr: Case detection ratio, measuring the estimated share of TB cases detected by the health system.
  • year_z: Standardized version of year, used so the year effect is on a comparable scale with other predictors.
  • e_inc_100k_z: Standardized TB incidence predictor used in the Bayesian models.
  • e_mort_100k_z: Standardized TB mortality predictor used in the Bayesian models.
  • c_cdr_z: Standardized case detection predictor used in the Bayesian models.
  • e_tbhiv_prct: TB-HIV percentage, kept for sensitivity analysis but not included in the primary M1/M2/M3 models.
  • used_2021_defs_flg: Indicator for observations using the post-2021 WHO treatment outcome definitions, used only in sensitivity checks.

The primary response is built from newrel_succ and newrel_coh, which are renamed in the locked table as success and cohort. The primary model predictors are year_z, e_inc_100k_z, e_mort_100k_z, c_cdr_z, and WHO region. The variable prop_success is retained only for descriptive summaries and posterior predictive checks; the models are fitted to counts, not percentages.

Locked Table Preview

This preview shows the actual country-year structure used by the Bayesian models.

First five rows of the locked modeling table used by M1, M2, and M3.
iso3 year country_id region_id g_whoregion success cohort prop_success e_inc_100k e_mort_100k c_cdr year_z e_inc_100k_z e_mort_100k_z c_cdr_z e_tbhiv_prct used_2021_defs_flg
AFG 2012 1 3 EMR 25128 28679 0.88 194 46 48 -1.70 0.37 0.50 -1.29 0.03 NA
AFG 2013 1 3 EMR 26733 30507 0.88 194 45 50 -1.40 0.37 0.47 -1.16 0.03 NA
AFG 2014 1 3 EMR 27553 31746 0.87 197 46 49 -1.10 0.39 0.50 -1.23 0.03 NA
AFG 2015 1 3 EMR 31631 36042 0.88 200 44 53 -0.80 0.40 0.45 -0.96 0.03 NA
AFG 2016 1 3 EMR 37418 40287 0.93 204 39 59 -0.51 0.43 0.33 -0.57 0.03 NA

Unit of Analysis

Each row represents one country in one year. The composite key is (iso3, year).

Response variable: Treatment success is modeled as counts (\(Y_{it}\) successes out of \(n_{it}\) cohort members), not percentages. This preserves the denominator for binomial-type likelihoods.

Analysis Goals

  1. Compare three Bayesian models using DIC and posterior predictive checks
  2. Determine whether binomial variance is adequate or overdispersion is required
  3. Quantify persistent country-level heterogeneity in treatment success

Filtering Pipeline

Before fitting the models, the raw merged WHO data were reduced to a clean and comparable country-year dataset. Each filter removes rows that are outside the study window, not comparable under the selected outcome definition, incomplete, invalid, or too small to model reliably.

Read the table from top to bottom. The “rows” column shows how many country-years remain after each step. The final row is the locked dataset used by all three Bayesian models.

Sample attrition at each filtering step.
step filter rows countries years
0 Raw merged rows 5130 217 24
1 Restrict to years 2012-2023 2580 215 12
2 rel_with_new_flg == 1 2138 212 12
3 Drop missing identifiers 2138 212 12
4 Valid outcomes (cohort>0, 0<=success<=cohort) 2094 207 12
5 Drop missing core predictors 2071 206 12
6 cohort >= 50 1862 180 12

What each filter means:

  • Raw merged rows: All country-year rows available after merging the relevant WHO outcome and burden tables.
  • Restrict to years 2012-2023: Keeps only the study period used in the title and analysis.
  • rel_with_new_flg == 1: Keeps rows where the selected treatment outcome definition is available and comparable.
  • Drop missing identifiers: Removes rows without essential country or year identifiers.
  • Valid outcomes: Keeps only rows where the cohort is positive and the number of successes is between 0 and the cohort size.
  • Drop missing core predictors: Removes rows missing one of the main predictors used in all primary models.
  • cohort >= 50: Removes very small cohorts because their success rates can be unstable and overly noisy.

Final Sample

After filtering, the final locked dataset contains 1,862 country-years from 180 countries over 2012-2023. This same dataset is used for M1, M2, and M3, so the model comparison is fair.

The cohort-size filter removes 26 countries entirely. The most affected WHO region is AMR, where 19% of countries are lost because they do not meet the cohort threshold.

Final sample characteristics.
item value
Final number of countries 180
Final year range 2012-2023
Final number of country-years 1862
Countries removed by cohort >= 50 filter 26
Most affected region AMR
Loss percentage in most affected region 19%

Predictor Screening

Before fitting the models, we checked whether the main predictors were too strongly related to each other. This matters because highly correlated predictors can make coefficient estimates unstable and difficult to interpret.

Figure: Correlation heatmap of the main epidemiological predictors. Blue indicates positive correlation, red indicates negative correlation, and stronger color intensity means a stronger relationship. Incidence and mortality have the strongest positive correlation (r = 0.844). Year is not shown in this heatmap but was included in the VIF screening.

Collinearity conclusion: Max |r| = 0.844 and max VIF = 3.59. The incidence-mortality correlation is high but still below the pre-specified exclusion threshold of 0.85. The maximum VIF is below the common warning level of 5. Therefore, year, incidence, mortality, and case detection were retained together in all primary models.

Attrition Flow

Figure: Horizontal bar chart of sample retention after each filtering step. The chart starts from the merged 2012-2023 sample and shows how many country-years remain after each filter. The final bar is the locked analysis sample of 1,862 country-years (72.2% of the 2012-2023 merged sample).

The final dataset is not an arbitrary subset. Each reduction follows a documented rule: keep the intended study years, keep comparable treatment-outcome definitions, remove invalid or incomplete records, and exclude very small cohorts. The same locked sample is used for all three Bayesian models.

Region-Year Coverage

Figure: Number of retained countries in each WHO region and year after filtering. Each cell shows how many countries from that region are available in the locked analysis table for that year. Darker green cells mean more retained countries. All six WHO regions remain represented across the full 2012-2023 period, although coverage is uneven. EUR and AFR have the largest retained counts in most years, while SEA has fewer countries because the region itself contains fewer countries. This supports the use of region fixed effects, but also shows why region-level comparisons should consider different sample sizes.

Region-Year Success Heatmap

Figure: Average treatment success by WHO region and year. Each cell shows the mean success rate for one WHO region in one year. The plot shows that treatment success is generally high, but not equal across regions. SEA, AFR, EMR, and WPR mostly stay in the low-to-high 80% range, while EUR and AMR are lower in many years. This supports including WHO region in the models, because regional differences are visible before modeling.


Exploratory Patterns

Treatment success is generally high, but it is not uniform. Success rates differ by WHO region, change over time, and include a lower tail of country-years with much poorer outcomes. These patterns suggest that a simple binomial model may be too rigid, because the data vary more than ordinary sampling noise alone would explain.

Success Rates by Region

Figure: Distribution of observed treatment success rates by WHO region. Most country-years have high success rates, mainly between about 80% and 95%, but the distributions differ by region. Some regions have wider spreads and lower tails, meaning that a number of country-years have much poorer outcomes. This suggests that ordinary binomial variation alone may be too restrictive and motivates models with overdispersion and country-level heterogeneity.

Bivariate Relationships

Figure: Bivariate relationships between treatment success and the main epidemiological predictors before Bayesian modeling. Each point is a country-year observation, colored by WHO region, and the black curve shows the overall smoothed pattern. The relationships are not purely linear and there is substantial spread around the trends. This supports using a flexible count-based modeling approach rather than relying only on simple descriptive correlations.

Taken together, the EDA shows three important features: high average success, clear regional differences, and extra variation across country-years. This is why the model comparison starts with a simple binomial model and then adds overdispersion and country-level heterogeneity.


Models and Priors

We compare three models of increasing complexity. Each model adds one new feature to test whether that feature improves fit.

Model Roadmap

The models form a deliberate sequence. Each step tests one additional source of variation.

Model Likelihood Linear Predictor Added Feature Question Answered Expected Weakness
M1 \(Y_{it} \sim \text{Binomial}(n_{it}, p_{it})\) \(\text{logit}(p_{it}) = \beta_0 + \mathbf{x}_{it}^\top\boldsymbol{\beta} + \gamma_{r[i]}\) None Is binomial variation enough? Underestimates variance if overdispersion exists
M2 \(Y_{it} \sim \text{Beta-Binomial}(n_{it}, \mu_{it}, \phi)\) \(\text{logit}(\mu_{it}) = \beta_0 + \mathbf{x}_{it}^\top\boldsymbol{\beta} + \gamma_{r[i]}\) \(\phi\) Does extra variability improve fit? Ignores persistent country differences
M3 \(Y_{it} \sim \text{Beta-Binomial}(n_{it}, \mu_{it}, \phi)\) \(\text{logit}(\mu_{it}) = \beta_0 + \mathbf{x}_{it}^\top\boldsymbol{\beta} + \gamma_{r[i]} + u_i\) \(\phi\), \(u_i \sim N(0, \sigma_u^2)\) Do countries have stable differences? More complex; may miss extreme tails

M1 assumes variation comes only from binomial sampling. M2 adds \(\phi\) to allow extra-binomial variation. M3 adds \(u_i\), a country-specific random effect, so countries can have persistent differences after adjusting for predictors and WHO region.


Common Notation

All three models share the following notation:

  • \(Y_{it}\): Number of successful treatments in country \(i\) and year \(t\)
  • \(n_{it}\): Treatment cohort size in country \(i\) and year \(t\)
  • \(i\): Country index; \(t\): Year index
  • \(\mathbf{x}_{it}\): Standardized predictors (year, incidence, mortality, case detection)
  • \(\boldsymbol{\beta}\): Predictor coefficients
  • \(\beta_0\): Baseline log-odds of treatment success
  • \(\gamma_{r[i]}\): WHO region fixed effect for country \(i\)’s region
  • \(\gamma_1 = 0\): AFR is the reference region; other region effects are relative to AFR
  • \(\text{logit}(p) = \log(p/(1-p))\): Link function connecting probability to linear predictor

M1: Binomial

\[Y_{it} \sim \text{Binomial}(n_{it}, p_{it}), \quad \text{logit}(p_{it}) = \beta_0 + \mathbf{x}_{it}^\top \boldsymbol{\beta} + \gamma_{r[i]}\]

M1 assumes that once predictors and region are included, the remaining variation is only ordinary binomial sampling variation. If M1 fails posterior predictive checks, the data vary more than this simple model allows.

What M1 tests: Whether binomial sampling variation is sufficient. If it fits poorly, the data require extra variation.


M2: Beta-Binomial

\[Y_{it} \sim \text{Beta-Binomial}(n_{it}, \mu_{it}, \phi), \quad \text{logit}(\mu_{it}) = \beta_0 + \mathbf{x}_{it}^\top \boldsymbol{\beta} + \gamma_{r[i]}\]

M2 uses the same predictors as M1 but changes the likelihood to Beta-Binomial. We use the mean-precision parameterization. Equivalently, the success probability can be viewed as a latent probability:

\[ \theta_{it} \sim \text{Beta}(\mu_{it}\phi, (1-\mu_{it})\phi), \qquad Y_{it} \mid \theta_{it} \sim \text{Binomial}(n_{it}, \theta_{it}) \]

Here, \(\mu_{it}\) is the mean success probability and \(\phi\) is the precision parameter:

  • Smaller \(\phi\) = more extra variability
  • Larger \(\phi\) = less extra variability
  • As \(\phi \to \infty\), M2 approaches M1

Variance formula:

\[ \mathrm{Var}(Y_{it}) \;=\; n_{it} \cdot \mu_{it} \cdot (1-\mu_{it}) \;\times\; \frac{n_{it}+\phi}{1+\phi} \]

The first term, \(n_{it} \cdot \mu_{it} \cdot (1-\mu_{it})\), is the binomial variance. The second term, \((n_{it}+\phi)/(1+\phi)\), is the inflation factor. When this factor exceeds 1, M2 allows more variability than the binomial model.

What M2 tests: Whether extra-binomial variation exists. If M2 fits much better than M1, overdispersion is present.


M3: Hierarchical

\[Y_{it} \sim \text{Beta-Binomial}(n_{it}, \mu_{it}, \phi), \quad \text{logit}(\mu_{it}) = \beta_0 + \mathbf{x}_{it}^\top \boldsymbol{\beta} + \gamma_{r[i]} + u_i, \quad u_i \sim N(0, \sigma_u^2)\]

M3 keeps the beta-binomial likelihood from M2 and adds \(u_i\), a country-specific random intercept:

  • \(u_i\): Country random effect. Positive \(u_i\) = higher success than expected; negative \(u_i\) = lower success than expected.
  • \(\sigma_u\): Standard deviation of country effects. Larger \(\sigma_u\) = stronger persistent differences between countries.

M3 separates two sources of extra variation: \(\phi\) captures general overdispersion, while \(\sigma_u\) captures systematic country differences.

What M3 tests: Whether countries have stable differences not explained by predictors or region. If M3 fits better than M2, persistent country-level heterogeneity is needed.


Prior Specification

All priors are weakly informative: they prevent unrealistic values but are wide enough for the data to drive final estimates.

Prior distributions. JAGS uses precision in dnorm(), so dnorm(0, 0.16) = Normal(0, 2.5^2) since precision = 1/2.5^2 = 0.16.
Parameter Prior JAGS syntax Meaning
beta_0, beta_j, gamma_r Normal(0, 2.5^2) dnorm(0, 0.16) Regression coefficients centered at zero; SD=2.5 is wide on logit scale
gamma_1 (AFR) Fixed at 0 gamma[1] <- 0 AFR is reference region
phi Gamma(2, 0.1) dgamma(2, 0.1) Overdispersion; prior mean 20
sigma_u Half-Normal(0, 1) dnorm(0, 1) T(0,) Country RE SD; must be positive

Prior Predictive Checks

A prior predictive check asks: what kind of data would the priors generate before seeing the real data? The goal is to verify that priors allow realistic patterns.

  • Left panels: Simulated mean success rates
  • Right panels: Simulated standard deviations
  • Red dashed line: Observed value from the locked dataset

Figure: Prior predictive checks. The priors generate a wide range of plausible success levels. The observed mean (0.86) falls within the prior predictive range, and the priors allow more variability than observed, which is appropriate for weakly informative priors.

The priors are broad enough to let the data dominate while preventing unrealistic values. This supports the prior choice.


Computation and Recovery

All models were fitted with MCMC in JAGS. M1 and M2 converged easily. M3 required a special parameterization to resolve mixing problems.

MCMC Settings

All three Bayesian models were fitted in JAGS (rjags) on the same locked 1,862-row dataset.

Setting M1 M2 M3
Chains 4 4 4 (parallel)
Adaptation 1,000 1,000 1,000
Burn-in 4,000 4,000 8,000
Post-burn-in 8,000 8,000 10,000
Thinning 1 1 10
Global seed 2026 2026 2026

M3 Fitting Challenge

M1 and M2 converged with standard parameterizations. M3 was difficult.

The problem in plain language: In M3, the intercept (\(\beta_0\)), region effects (\(\gamma_r\)), and the average of country effects within each region can trade off against each other. If the intercept increases, the region effects can decrease, and vice versa. This makes it hard for MCMC to explore the posterior efficiently.

Technical details: The plain centered parameterization had min ESS \(\approx\) 12 and max R-hat \(\approx\) 1.16 on \(\beta_0\) and \(\gamma\). Standard non-centered parameterizations did not resolve the issue because they only separate scale, not location.

The solution: We used a region-centered non-centered parameterization:

\[ u_c = \sigma_u \, (z_c - \bar{z}_{r(c)}), \quad z_c \sim N(0,1) \]

This forces the country effects within each region to sum to zero: \(\sum_{c \in r} u_c = 0\). This constraint separates region effects from country effects and resolves the identifiability problem.

Implementation efficiency: To compute the within-region mean quickly, we permuted country IDs to be contiguous within each region. This allows fast summation instead of dense matrix operations. Four chains ran in parallel with independent random seeds.

MCMC Diagnostics

All three models passed the acceptance criteria: R-hat \(\leq\) 1.05 and ESS \(\geq\) 400 for key parameters.

Compact MCMC diagnostics for each model’s final fit. The Source column shows which diagnostic file was used. Max R-hat should be near 1.0. Min ESS should be at least 400. Status PASS means the fit is acceptable for inference.
Model Source Max_Rhat Min_ESS Status
M1 m1_extended_diagnostics_summary.csv 1.008 557.005 PASS
M2 m2_extended_diagnostics_summary.csv 1.005 484.523 PASS
M3 m3_regioncentered_full_diagnostics_summary.csv 1.007 407.148 PASS

Acceptance rule: Max R-hat \(\leq\) 1.05 and min ESS \(\geq\) 400 for key parameters (\(\beta_0\), \(\beta_{1:4}\), \(\gamma_{2:6}\), \(\phi\), \(\sigma_u\)).

Results:

  • M1: Passes easily with standard parameterization
  • M2: Passes easily with standard parameterization
  • M3: Passes with region-centered parameterization (min ESS = 407.1, max R-hat = 1.0072)

Figure: M1 trace and density diagnostic for a selected monitored parameter. The trace plot shows that the MCMC samples fluctuate around a stable level without a visible trend, which is what we expect from a well-mixing chain. The density plot summarizes the posterior distribution of the parameter after sampling. Together with the R-hat and ESS results above, this supports acceptable convergence for M1.

Figure: M2 trace and density diagnostics for selected regression and overdispersion parameters. The trace plots show stable sampling with no clear drift over iterations, and the density plots show smooth posterior distributions. This indicates that the beta-binomial model mixed well and that its posterior samples can be used for inference.

Figure: M3 trace and density diagnostics for selected parameters after using the region-centered parameterization. The trace plots show stable sampling for the region effect, overdispersion parameter, and country-level heterogeneity parameter. The density plots are smooth and concentrated, suggesting that the final M3 fit mixes well. This supports the conclusion that the reparameterized hierarchical model is reliable for posterior inference.

Overall, the trace and density plots show stable sampling behavior for the monitored parameters. The visual diagnostics agree with the numerical diagnostics: all final models satisfy the convergence rule of R-hat at or below 1.05 and ESS above 400 for the key monitored parameters. Therefore, the posterior samples are acceptable for the model comparison and substantive interpretation.


Parameter Recovery

We simulated data from each model and refitted it. The 95% credible intervals captured the true values about 93-97% of the time, supporting good calibration under the simulated recovery design.

What is Parameter Recovery?

Parameter recovery tests whether our Bayesian procedure can correctly estimate parameters when we know the true values. We:

  1. Take the posterior mean parameters from the real-data fit
  2. Simulate new response data \(Y\) from the model (keeping the design fixed)
  3. Refit the model to the simulated data
  4. Check whether the new posterior intervals contain the true values

If coverage is near 95%, the inference is well-calibrated.

Executed Design

Recovery replicates completed. The original target was 50 replicates per model. Due to computational cost (especially for M3), we ran 30/30/10. This is a practical accommodation, not a methodological failure. All executed replicates converged.
model total_reps n_ok mean_runtime_s total_runtime_s failure_rate
M1 30 30 140.697 4220.897 0
M2 30 30 3084.431 92532.918 0
M3 10 10 5196.993 51969.931 0
Recovery summary. The ‘executed’ column shows successful/total replicates. Mean coverage shows how often 95 percent credible intervals contain the true value. Values near 0.95 indicate good calibration.
model executed failure_rate_pct mean_coverage_95
M1 30/30 0% 0.930
M2 30/30 0% 0.958
M3 10/10 0% 0.975

Full Recovery Results

Recovery performance by parameter. Bias shows systematic over/under-estimation. RMSE combines bias and variance. Coverage 95 shows how often intervals contain the true value.
model parameter true_value n_reps mean_estimate bias abs_bias rmse coverage_95 mean_rhat min_ess
M1 beta0 2.048 30 2.048 0.000 0.001 0.001 0.933 1.011 112.728
M1 beta[1] 0.039 30 0.038 0.000 0.000 0.000 0.867 1.002 895.780
M1 beta[2] -0.103 30 -0.103 0.000 0.000 0.000 0.967 1.004 359.979
M1 beta[3] -0.097 30 -0.097 0.000 0.001 0.001 0.967 1.009 166.757
M1 beta[4] 0.092 30 0.092 0.000 0.000 0.000 0.933 1.002 1030.725
M1 gamma[2] -1.169 30 -1.169 0.000 0.002 0.002 0.933 1.006 178.988
M1 gamma[3] 0.439 30 0.439 0.000 0.001 0.002 0.967 1.006 208.677
M1 gamma[4] -1.110 30 -1.110 0.000 0.002 0.002 0.967 1.004 180.538
M1 gamma[5] -0.288 30 -0.289 0.000 0.001 0.001 0.900 1.011 104.031
M1 gamma[6] 0.121 30 0.120 0.000 0.001 0.002 0.867 1.010 120.294
M2 beta0 1.472 30 1.472 -0.001 0.029 0.035 0.967 1.008 176.681
M2 beta[1] 0.017 30 0.021 0.004 0.013 0.016 0.933 1.001 1702.397
M2 beta[2] 0.244 30 0.235 -0.009 0.033 0.041 0.933 1.006 252.378
M2 beta[3] -0.280 30 -0.269 0.011 0.031 0.035 1.000 1.006 187.475
M2 beta[4] -0.056 30 -0.052 0.004 0.018 0.023 0.933 1.002 866.026
M2 gamma[2] -0.399 30 -0.403 -0.004 0.037 0.049 1.000 1.005 248.227
M2 gamma[3] 0.042 30 0.038 -0.004 0.049 0.059 1.000 1.005 327.746
M2 gamma[4] -0.397 30 -0.394 0.003 0.040 0.052 0.933 1.006 207.651
M2 gamma[5] 0.129 30 0.133 0.004 0.066 0.085 0.900 1.002 507.378
M2 gamma[6] -0.092 30 -0.093 -0.001 0.038 0.050 0.967 1.004 283.578
M2 phi 10.452 30 10.457 0.005 0.244 0.316 0.967 1.001 1494.996
M3 beta0 1.768 10 1.768 0.000 0.016 0.017 1.000 1.022 66.844
M3 beta[1] -0.012 10 -0.005 0.007 0.007 0.008 1.000 1.004 609.049
M3 beta[2] 0.283 10 0.301 0.018 0.051 0.060 0.900 1.043 42.105
M3 beta[3] -0.387 10 -0.392 -0.005 0.036 0.048 0.900 1.050 48.056
M3 beta[4] 0.024 10 0.015 -0.009 0.021 0.023 1.000 1.011 180.275
M3 gamma[2] -0.652 10 -0.637 0.015 0.025 0.029 1.000 1.013 83.277
M3 gamma[3] -0.152 10 -0.158 -0.006 0.033 0.045 1.000 1.010 109.472
M3 gamma[4] -0.741 10 -0.720 0.021 0.044 0.049 1.000 1.019 79.101
M3 gamma[5] 0.062 10 0.077 0.015 0.052 0.068 0.900 1.013 314.000
M3 gamma[6] -0.303 10 -0.288 0.016 0.030 0.037 1.000 1.029 112.944
M3 phi 42.861 10 42.564 -0.296 0.981 1.251 1.000 1.002 1625.500
M3 sigma_u 0.717 10 0.677 -0.040 0.040 0.043 1.000 1.015 68.918

Figure: Recovery performance measured by 95% credible-interval coverage for each parameter in M1, M2, and M3. The dashed horizontal line marks the nominal 95% target. Bars close to this line indicate that the fitted model usually recovers the true parameter value at the expected rate. Overall, coverage is close to the target for most parameters, which supports good calibration under the recovery design.

Recovery Conclusions

  • M1: Mean coverage 0.93 (10 parameters)
  • M2: Mean coverage 0.96 (11 parameters including \(\phi\))
  • M3: Mean coverage 0.97 (12 parameters including \(\phi\) and \(\sigma_u\))

All models show good calibration. Biases are small relative to posterior standard deviations. The reduced M3 design (10 replicates) provides less precision but still supports inferential credibility.


Results and Model Checking

M3 is preferred among the three fitted models. It has the lowest DIC and the best posterior predictive performance among the three models, though some residual miscalibration remains.

Model Comparison

Which Model Wins?

M3 (hierarchical beta-binomial) is preferred. The evidence is decisive.

DIC comparison. Lower DIC is better. M3 has the lowest DIC by a large margin. The difference of 2,220 between M3 and M2 is strong evidence in favor of M3.
Model D_bar D_theta_bar p_D DIC Delta_DIC Rank Evidence
M1 2666291.11 2666281.03 10.074 2666301.18 2641361.130 3 Strong evidence against
M2 27149.69 27138.85 10.844 27160.54 2220.488 2 Strong evidence against
M3 24761.35 24582.64 178.703 24940.05 0.000 1 Best model

How to interpret DIC differences:

ΔDIC Evidence
> 10 Strong evidence for lower-DIC model
5-10 Moderate evidence
< 5 Interpret cautiously

Here, ΔDIC between M3 and M2 is 2,220 — overwhelming evidence. The difference between M2 and M1 is 2.6 million — M1 is severely inadequate for these data.


Posterior Predictive Checks

Posterior predictive checks (PPC) test whether the model can generate data that looks like what we observed. M3 calibrates the mean well and approximately captures the variance, while lower-tail and within-region variance miscalibration remains. M1 fails badly.

How to Read This Table

What are Bayesian p-values? For each test quantity, the Bayesian p-value reports where the observed statistic falls relative to the posterior predictive distribution. Values near 0.5 indicate good calibration; values near 0 or 1 indicate systematic mismatch.

  • p near 0.5: The model reproduces this feature well
  • p near 0 or 1: The model systematically over- or under-predicts this feature

Important: These are not frequentist p-values. We are not testing a null hypothesis. We are checking calibration.

Posterior predictive check results. T1a/T1b test central tendency. T2 tests variance. T3 tests the lower tail. T4a/T4b test within-region variance. Values near 0.5 indicate good calibration.
model test_quantity observed rep_mean rep_sd p_value
M1 T1a: Unweighted mean success 0.791 0.821 0.000 1.000
M1 T1b: Cohort-weighted aggregate 0.856 0.856 0.000 0.498
M1 T2: Variance of success rates 0.020 0.006 0.000 0.000
M1 T3: Count below 0.70 309.000 50.370 4.965 0.000
M1 T4a: Within-region var (equal wt) 0.016 0.001 0.000 0.000
M1 T4b: Within-region var (size wt) 0.018 0.001 0.000 0.000
M2 T1a: Unweighted mean success 0.791 0.783 0.004 0.022
M2 T1b: Cohort-weighted aggregate 0.856 0.828 0.012 0.005
M2 T2: Variance of success rates 0.020 0.017 0.001 0.000
M2 T3: Count below 0.70 309.000 455.888 22.794 1.000
M2 T4a: Within-region var (equal wt) 0.016 0.015 0.001 0.060
M2 T4b: Within-region var (size wt) 0.018 0.015 0.001 0.000
M3 T1a: Unweighted mean success 0.791 0.788 0.002 0.061
M3 T1b: Cohort-weighted aggregate 0.856 0.852 0.008 0.321
M3 T2: Variance of success rates 0.020 0.019 0.001 0.042
M3 T3: Count below 0.70 309.000 360.075 13.410 1.000
M3 T4a: Within-region var (equal wt) 0.016 0.015 0.001 0.004
M3 T4b: Within-region var (size wt) 0.018 0.017 0.001 0.019

PPC Test Statistics

The following posterior predictive plots compare the observed test statistics with their posterior predictive distributions for each model. They make the model comparison visually clear: M1 shows severe misfit, M2 improves the fit but remains imperfect, and M3 provides the best overall calibration among the three models.

Figure: Posterior predictive checks for M1. In each panel, the blue histogram and curve show the model’s replicated values, while the red vertical line shows the observed value from the real data. M1 matches the cohort-weighted mean reasonably well, but it clearly misses the variance, the number of low-success observations, and the within-region variability. This shows that the simple binomial model is too rigid for these data.

Figure: Posterior predictive checks for M2. Compared with M1, the beta-binomial model produces replicated data that are closer to the observed data for several test statistics, especially the overall spread. However, some mismatch remains, particularly for the lower-tail count and, to a lesser extent, the mean and within-region variability. This indicates that adding overdispersion improves fit, but does not fully capture the structure in the data.

Figure: Posterior predictive checks for M3. The replicated distributions are generally closer to the observed values than in M1 and M2, especially for the cohort-weighted mean and the overall variance. This makes M3 the best-fitting model among the three. Even so, the figure still shows some remaining mismatch for the lower-tail count and within-region variance, so the fit is improved but not perfect.

PPC Results by Model

M1 (Binomial): Severe misfit

  • Variance test (T2): p ≈ 0 - the binomial predicts far too little variance
  • Lower-tail test (T3): p ≈ 0 - it predicts far fewer low-success country-years
  • M1 cannot reproduce the observed data features

M2 (Beta-Binomial): Better but imperfect

  • Variance test (T2): Improved
  • Mean test (T1b): p = 0.005 - slightly underpredicts the cohort-weighted mean
  • Lower-tail test (T3): Overshoots

M3 (Hierarchical): Best among the three

  • Mean test (T1b): p = 0.32 — well calibrated
  • Variance test (T2): p = 0.04 — much improved relative to M1 and M2, but still slightly tail-calibrated rather than perfectly centered
  • Lower-tail test (T3): p = 1.0 — M3 still tends to predict too many country-years below 0.70 success
  • Within-region tests (T4a/T4b): Some residual issues

Conclusion: M3 is the best model, but not perfect. Remaining lower-tail and within-region variance miscalibration are targets for future model extensions.

Cohort Calibration

The cohort calibration plots check whether each model behaves consistently across different cohort sizes. This is important because the response is modeled as counts, and country-years with larger cohorts should strongly influence the likelihood. These plots provide an additional visual check that complements the PPC test statistics above.

Figure: M1 cohort calibration by cohort size. Each point compares the observed success rate with the mean posterior predictive success rate for one country-year, separately for large and small cohorts. The dashed diagonal line represents perfect calibration: points close to the line mean predicted and observed values agree. Many points are far from the line, showing that the binomial model does not reproduce the observed variability well, especially across cohort-size groups.

Figure: M2 cohort calibration by cohort size. The beta-binomial model allows extra variability, so the points move closer to the diagonal line compared with M1. This shows improved calibration, especially because M2 is less rigid than the simple binomial model. However, several points still deviate from the line, meaning that overdispersion alone does not fully explain the observed country-year differences.

Figure: M3 cohort calibration by cohort size. The hierarchical beta-binomial model gives the closest agreement between observed and predicted success rates, with many points lying near the diagonal line. This indicates that adding country random effects improves calibration for both large and small cohorts. Some dispersion remains, especially among smaller cohorts, but M3 provides the best cohort-level calibration among the three models.

M3 Posterior Predictive Density Overlay

This density overlay compares the observed distribution of treatment success with replicated datasets generated from the posterior predictive distribution of M3. It provides an intuitive final visual check of whether the preferred model reproduces the overall outcome distribution.

Figure: M3 posterior predictive density overlay. The dark red curve shows the observed distribution of treatment success rates, while the light blue curves show 100 datasets simulated from the M3 posterior predictive distribution. The model reproduces the main high-success peak reasonably well, but the simulated curves are slightly more spread out than the observed curve. This supports M3 as a good overall fit, while still showing some remaining mismatch in the lower-success part of the distribution.

PPC Visual: Variance Recovery Across Models

Figure: Variance recovery across models. The red vertical line shows the observed variance of treatment success rates, and each colored density shows the variance generated by posterior predictive simulations from one model. M1 is far to the left, meaning it strongly underestimates the observed variability. M2 and M3 are much closer to the observed variance, with M3 closest overall, although the observed variance is still near the upper edge of the M3 predictive distribution.


M3 Posterior Inference

Under M3, mortality is negatively associated with success, incidence is positively associated (after controlling for country effects), and there is substantial country-level heterogeneity.

M3 posterior summaries. Mean is the posterior mean. CrI = credible interval (Bayesian). Read signs and intervals together to interpret effects.
parameter mean median sd 95% CrI lower 95% CrI upper hpd_lower hpd_upper model label
beta0 1.768 1.767 0.034 1.702 1.835 1.702 1.835 M3 (Hierarchical) Intercept
beta[1] -0.012 -0.012 0.010 -0.032 0.009 -0.032 0.008 M3 (Hierarchical) Year (standardized)
beta[2] 0.283 0.284 0.053 0.180 0.385 0.179 0.385 M3 (Hierarchical) Incidence (standardized)
beta[3] -0.387 -0.387 0.044 -0.473 -0.300 -0.474 -0.301 M3 (Hierarchical) Mortality (standardized)
beta[4] 0.024 0.024 0.024 -0.022 0.070 -0.021 0.071 M3 (Hierarchical) Case Detection (standardized)
gamma[2] -0.652 -0.652 0.054 -0.756 -0.546 -0.757 -0.548 M3 (Hierarchical) AMR
gamma[3] -0.152 -0.151 0.048 -0.245 -0.058 -0.245 -0.058 M3 (Hierarchical) EMR
gamma[4] -0.741 -0.741 0.057 -0.853 -0.629 -0.854 -0.630 M3 (Hierarchical) EUR
gamma[5] 0.062 0.061 0.054 -0.043 0.168 -0.043 0.167 M3 (Hierarchical) SEA
gamma[6] -0.303 -0.303 0.048 -0.397 -0.209 -0.395 -0.208 M3 (Hierarchical) WPR
phi 42.861 42.842 1.690 39.621 46.258 39.555 46.169 M3 (Hierarchical) Overdispersion (phi)
sigma_u 0.717 0.715 0.041 0.639 0.804 0.634 0.798 M3 (Hierarchical) Country RE SD (sigma_u)

Key Findings

Intercept (\(\beta_0\) = 1.77): The baseline log-odds corresponds to roughly 85% success probability at reference predictor values.

Mortality (\(\beta\) = −0.39, 95% CrI: −0.47 to −0.30): Higher mortality is associated with lower treatment success. The entire posterior is below zero (P < 0 = 1.000).

Incidence (\(\beta\) = +0.28, 95% CrI: 0.18 to 0.39): Higher incidence is associated with higher success after controlling for country effects. This sign reversal suggests strong between-country confounding. Persistent low-performing countries tend to have high incidence; after country-level differences are absorbed, the conditional incidence association changes direction. P > 0 = 1.000.

Year and case detection: Both effects are weak and posteriorly near zero.

Overdispersion (\(\phi\) = 42.9, 95% CrI: 39.6-46.3): The finite value confirms overdispersion is present. The data have more variance than binomial sampling alone would predict.

Country heterogeneity (\(\sigma_u\) = 0.72, 95% CrI: 0.64-0.80): Substantial between-country variation on the logit scale. Countries differ persistently in treatment success beyond what covariates explain.


Country Random Effects

M3 estimates a random effect for each country. Countries with positive effects have persistently higher success than expected; those with negative effects have persistently lower success.

Figure: Country random effects from the M3 hierarchical model. Each point is the posterior mean country effect, and each horizontal line is its 95% credible interval on the logit scale. Countries to the right of zero have higher treatment success than expected after adjusting for region and predictors, while countries to the left of zero have lower success than expected. This plot shows persistent country-level differences, but it should be read as descriptive benchmarking, not as a causal ranking.

Highest and Lowest Country Effects

Countries with the highest positive random effects. These countries have persistently higher treatment success than expected after adjusting for region and covariates. CrI = credible interval.
iso3 mean 95% CrI lower 95% CrI upper g_whoregion
MLT 1.753 1.044 2.550 EUR
SLB 1.069 0.776 1.386 WPR
MNE 1.028 0.701 1.384 EUR
CHN 1.023 0.727 1.351 WPR
TJK 0.997 0.741 1.279 EUR
SVK 0.979 0.718 1.266 EUR
COD 0.962 0.633 1.326 AFR
SLV 0.914 0.663 1.183 AMR
TZA 0.902 0.634 1.199 AFR
KHM 0.899 0.598 1.233 WPR
Countries with the lowest (most negative) random effects. These countries have persistently lower treatment success than expected after adjusting for region and covariates. CrI = credible interval.
iso3 mean 95% CrI lower 95% CrI upper g_whoregion
HRV -1.383 -1.569 -1.197 EUR
JAM -1.461 -1.693 -1.231 AMR
BHR -1.478 -1.747 -1.208 EMR
AGO -1.488 -1.679 -1.297 AFR
DNK -1.509 -1.701 -1.319 EUR
IRL -1.741 -1.944 -1.545 EUR
NCL -1.860 -2.558 -1.161 WPR
FIN -1.926 -2.138 -1.719 EUR
FRA -2.003 -2.238 -1.781 EUR
GRC -3.633 -4.396 -2.958 EUR

Interpretation: These rankings are adjusted for region and covariates. They describe which countries over- or under-perform relative to their predicted baseline. They do not identify causes of good or poor performance.


Robustness Checks

Completed robustness checks provide supplementary support for the main findings. Two prior-sensitivity arms were specified but not executed, so they are not used as evidence for the main conclusion.

Summary

Check Status Main Conclusion
Cohort Threshold Completed (frequentist) Coefficient signs stable
TB-HIV Predictor Completed (frequentist) TB-HIV negative; main effects stable
Post-2021 Definitions Completed (frequentist) Main signs preserved
Phi Prior Specification only Not executed
Sigma Prior Specification only Not executed
Frequentist M1/M2 Completed Coefficients agree with Bayesian
M3 GLMM Failed Matrix/lme4 binary incompatibility

Three sensitivity checks were completed using frequentist refits. Two prior-sensitivity arms were specified but not executed, so they are not used as evidence for the main conclusion. The main Bayesian conclusion remains based on the locked dataset and standard priors.

The M3 GLMM failure is due to a known software issue (function 'cholmod_factor_ldetA' not provided by package 'Matrix'), not a model specification problem. Since frequentist checks are supplementary, this is a non-fatal limitation.


Discussion

M3 best answers our research question. Binomial variability alone is insufficient; we need both overdispersion and persistent country effects.

Key Interpretations

Overdispersion (\(\phi = 42.9\)): Treatment success varies more than binomial sampling alone can explain. This extra variation is real and must be modeled.

Country heterogeneity (\(\sigma_u = 0.72\)): Countries have persistent performance differences that covariates and region cannot explain. This is substantively important: some countries consistently outperform or underperform their predicted baseline.

Incidence sign reversal: The incidence coefficient changes from negative in M1 to positive in M3. This sign reversal suggests strong between-country confounding. Persistent low-performing countries tend to have high incidence; after country-level differences are absorbed, the conditional incidence association changes direction.

Limitations

  • Ecological fallacy: Country-level data cannot prove individual-level causal claims
  • Association, not causation: All results are associative
  • Residual miscalibration: M3 still misfits the lower tail (T3 p = 1.0) and within-region variance
  • Reporting heterogeneity: Countries differ in data quality
  • Missingness: Filtered observations may not be missing at random
  • Outcome-definition changes: The shift in WHO treatment-outcome definitions over time may introduce discontinuities. The rel_with_new_flg == 1 filter improves comparability, and the post-2021 definition check is used as a sensitivity analysis, but definition changes cannot be ruled out completely.

Future Work

Extensions could include region-varying overdispersion, temporal random effects, mixture models for the lower tail, and out-of-sample prediction.


Conclusion

The hierarchical beta-binomial model (M3) is preferred among the three fitted models for modeling country-year TB treatment success. It captures both overdispersion (\(\phi = 42.9\)) and persistent country heterogeneity (\(\sigma_u = 0.72\)), with the lowest DIC (24,940) and the best posterior predictive performance among the three models.

The evidence is decisive: M1 (binomial) severely underestimates variability (DIC = 2.6 million), and M2 (beta-binomial) improves fit but still lacks country structure (DIC = 27,160). M3’s additional complexity is justified by substantially better fit.

Limitations and future work: M3 is the best among these three models, not a perfect model. Some lower-tail and within-region variance miscalibration remains. Future extensions could include region-varying overdispersion, temporal random effects, or mixture models to better capture the lower tail.


Reproducibility

This analysis can be reproduced by running src/main.R.

Data Provenance

Item Value
Source World Health Organization Global TB Programme
Download date 2026-04-04
Year window 2012-2023
Final sample 1,862 country-years, 180 countries
Cohort threshold cohort >= 50
Global seed 2026

File Structure

Location Contents
src/main.R Main analysis script
src/models/ JAGS model files
src/outputs/tables/ CSV tables
src/outputs/figures/ Figures
data/data_processed/ Locked analysis table

How to Reproduce

Regenerate the HTML report:

Rscript -e "rmarkdown::render('src/report/report.Rmd', output_format = 'html_document', output_file = 'Arman_Feili_FSL2_Final_Report.html', output_dir = 'src/report')"

Run the full analysis:

Rscript src/main.R

Software

  • R: R version 4.3.1 (2023-06-16)
  • Platform: x86_64-apple-darwin20 (64-bit)
  • JAGS: 4.3.2
  • rjags: 4.17
  • coda: 0.19.4.1
  • ggplot2: 3.5.2
  • dplyr: 1.1.4
  • knitr: 1.50
  • rmarkdown: 2.29

References

  1. World Health Organization. Global Tuberculosis Programme data. Downloaded 2026-04-04. Available at: https://www.who.int/teams/global-tuberculosis-programme/data

  2. Plummer M. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. Proceedings of the 3rd International Workshop on Distributed Statistical Computing, 2003.

  3. Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B, 2002; 64(4):583-639.

  4. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis, 3rd ed. Chapman and Hall/CRC, 2013.