Overview

Abstract

Background: Tuberculosis (TB) is a leading cause of death from infectious disease worldwide. The World Health Organization tracks treatment success rates for countries each year. We asked: what statistical model best captures the variation in these rates?

Objective: Compare three Bayesian models for country-year TB treatment success from 2012-2023: (1) a simple binomial model, (2) a beta-binomial model allowing overdispersion, and (3) a hierarchical beta-binomial model with country random effects.

Methods: We analyzed 1,862 country-years from 180 countries using WHO data. All models were fitted with MCMC via JAGS. We compared models using the Deviance Information Criterion (DIC) computed from the observed-data likelihood and posterior predictive checks (PPC). A reduced 30/30/10-replicate parameter recovery study supported inferential calibration.

Results: The hierarchical beta-binomial model (M3) is preferred.

DIC: M3 = 24,940 vs M2 = 27,160 vs M1 = 2,666,301
The binomial model (M1) severely underestimates variability
M3 captures both overdispersion (\(\phi = 42.9\)) and persistent country heterogeneity (\(\sigma_u = 0.72\))
Under M3: mortality is negatively associated with success, incidence is positively associated after controlling for country effects
Posterior predictive checks show that M3 calibrates the cohort-weighted mean well (Bayesian p = 0.32) and substantially improves variance recovery, although the variance check is still not perfectly centered (p = 0.04) and lower-tail miscalibration remains

Conclusion: TB treatment success varies more than binomial sampling alone can explain. Both overdispersion and persistent country effects are needed. M3 is the best model among those compared, though residual lower-tail miscalibration remains.

Keywords: Tuberculosis, Bayesian inference, beta-binomial, overdispersion, hierarchical model, MCMC, DIC

Main Finding

The hierarchical beta-binomial model (M3) is preferred among the three fitted models. It has the lowest DIC (24,940 vs 27,160 for M2 vs 2.6 million for M1) and the best posterior predictive performance among the three models. M3 captures both overdispersion (\(\phi = 42.9\)) and persistent country-level heterogeneity (\(\sigma_u = 0.72\)).

Introduction

This study uses Bayesian methods to determine what drives variation in TB treatment success across countries and years.

Background

Tuberculosis (TB) is one of the leading causes of death from a single infectious agent worldwide. The World Health Organization (WHO) tracks treatment outcomes for countries each year, publishing success rates as a key indicator of national program performance.

Research Gap

WHO provides descriptive statistics, but no formal Bayesian comparison has examined the statistical structure of this variability. The core question is:

Do differences in TB treatment success arise from simple sampling variability, or do we need models that account for overdispersion and persistent country effects?

Research Question

We compare three Bayesian models to find which best explains and reproduces TB treatment success.

Research question: Which Bayesian model best explains and reproduces country-year TB treatment success in 2012-2023: a binomial logistic model, a beta-binomial model, or a hierarchical beta-binomial model?

This question is:

Specific: It compares three well-defined models.
Measurable: We answer it using DIC and posterior predictive checks.
Fair: All models use the same locked dataset.

Data and EDA

The final dataset contains 1,862 country-years from 180 countries (2012-2023). All three models use this exact locked dataset.

Dataset Inventory

The project uses several WHO CSV tables, but all three Bayesian models are fitted only on a single locked analysis table. This ensures fair model comparison.

Raw WHO Files

File	Rows	Columns	Main Role	Key Column Groups	Used in Model?
TB_outcomes_2026-04-04.csv	6,399	73	Treatment outcomes	country, iso3, year, newrel_coh, newrel_succ, rel_with_new_flg	Yes (response)
TB_burden_countries_2026-04-04.csv	5,347	50	Epidemiological burden	e_inc_100k, e_mort_100k, c_cdr, e_tbhiv_prct	Yes (predictors)
TB_data_dictionary_2026-04-04.csv	699	4	Variable definitions	variable_name, definition	Documentation
TB_notifications_2026-04-04.csv	9,567	216	Notification counts	new_sp, new_sn, c_newinc	Background only
TB_provisional_notifications_2026-04-04.csv	647	24	Provisional data	m_01-m_12, q_1-q_4	Background only
main_analysis_table_locked.csv	1,862	17	Final modeling table	success, cohort, predictors	Yes (sole input)

Locked Analysis Table Schema

The locked table main_analysis_table_locked.csv is the sole input for M1, M2, and M3.

iso3: Three-letter ISO country code used to identify each country, such as AFG or ITA.
year: Calendar year of the observation.
country_id: Numeric ID used internally by JAGS to assign one country random effect to each country.
region_id: Numeric ID used internally by JAGS to assign the WHO region effect.
g_whoregion: WHO region label for the country, used to compare regions relative to the AFR baseline.
success: Number of patients in the cohort who completed treatment successfully.
cohort: Total number of patients included in the treatment outcome cohort.
prop_success: Observed success proportion, calculated as success divided by cohort. Used for summaries and posterior predictive checks, not as the modeled response.
e_inc_100k: Estimated TB incidence per 100,000 population before standardization.
e_mort_100k: Estimated TB mortality per 100,000 population before standardization.
c_cdr: Case detection ratio, measuring the estimated share of TB cases detected by the health system.
year_z: Standardized version of year, used so the year effect is on a comparable scale with other predictors.
e_inc_100k_z: Standardized TB incidence predictor used in the Bayesian models.
e_mort_100k_z: Standardized TB mortality predictor used in the Bayesian models.
c_cdr_z: Standardized case detection predictor used in the Bayesian models.
e_tbhiv_prct: TB-HIV percentage, kept for sensitivity analysis but not included in the primary M1/M2/M3 models.
used_2021_defs_flg: Indicator for observations using the post-2021 WHO treatment outcome definitions, used only in sensitivity checks.

The primary response is built from newrel_succ and newrel_coh, which are renamed in the locked table as success and cohort. The primary model predictors are year_z, e_inc_100k_z, e_mort_100k_z, c_cdr_z, and WHO region. The variable prop_success is retained only for descriptive summaries and posterior predictive checks; the models are fitted to counts, not percentages.

Locked Table Preview

This preview shows the actual country-year structure used by the Bayesian models.

First five rows of the locked modeling table used by M1, M2, and M3.
iso3	year	country_id	region_id	g_whoregion	success	cohort	prop_success	e_inc_100k	e_mort_100k	c_cdr	year_z	e_inc_100k_z	e_mort_100k_z	c_cdr_z	e_tbhiv_prct	used_2021_defs_flg
AFG	2012	1	3	EMR	25128	28679	0.88	194	46	48	-1.70	0.37	0.50	-1.29	0.03	NA
AFG	2013	1	3	EMR	26733	30507	0.88	194	45	50	-1.40	0.37	0.47	-1.16	0.03	NA
AFG	2014	1	3	EMR	27553	31746	0.87	197	46	49	-1.10	0.39	0.50	-1.23	0.03	NA
AFG	2015	1	3	EMR	31631	36042	0.88	200	44	53	-0.80	0.40	0.45	-0.96	0.03	NA
AFG	2016	1	3	EMR	37418	40287	0.93	204	39	59	-0.51	0.43	0.33	-0.57	0.03	NA

Unit of Analysis

Each row represents one country in one year. The composite key is (iso3, year).

Response variable: Treatment success is modeled as counts (\(Y_{it}\) successes out of \(n_{it}\) cohort members), not percentages. This preserves the denominator for binomial-type likelihoods.

Analysis Goals

Compare three Bayesian models using DIC and posterior predictive checks
Determine whether binomial variance is adequate or overdispersion is required
Quantify persistent country-level heterogeneity in treatment success

Filtering Pipeline

Before fitting the models, the raw merged WHO data were reduced to a clean and comparable country-year dataset. Each filter removes rows that are outside the study window, not comparable under the selected outcome definition, incomplete, invalid, or too small to model reliably.

Read the table from top to bottom. The “rows” column shows how many country-years remain after each step. The final row is the locked dataset used by all three Bayesian models.

Sample attrition at each filtering step.
step	filter	rows	countries	years
0	Raw merged rows	5130	217	24
1	Restrict to years 2012-2023	2580	215	12
2	rel_with_new_flg == 1	2138	212	12
3	Drop missing identifiers	2138	212	12
4	Valid outcomes (cohort>0, 0<=success<=cohort)	2094	207	12
5	Drop missing core predictors	2071	206	12
6	cohort >= 50	1862	180	12

What each filter means:

Raw merged rows: All country-year rows available after merging the relevant WHO outcome and burden tables.
Restrict to years 2012-2023: Keeps only the study period used in the title and analysis.
rel_with_new_flg == 1: Keeps rows where the selected treatment outcome definition is available and comparable.
Drop missing identifiers: Removes rows without essential country or year identifiers.
Valid outcomes: Keeps only rows where the cohort is positive and the number of successes is between 0 and the cohort size.
Drop missing core predictors: Removes rows missing one of the main predictors used in all primary models.
cohort >= 50: Removes very small cohorts because their success rates can be unstable and overly noisy.

Final Sample

After filtering, the final locked dataset contains 1,862 country-years from 180 countries over 2012-2023. This same dataset is used for M1, M2, and M3, so the model comparison is fair.

The cohort-size filter removes 26 countries entirely. The most affected WHO region is AMR, where 19% of countries are lost because they do not meet the cohort threshold.

Final sample characteristics.
item	value
Final number of countries	180
Final year range	2012-2023
Final number of country-years	1862
Countries removed by cohort >= 50 filter	26
Most affected region	AMR
Loss percentage in most affected region	19%

Predictor Screening

Before fitting the models, we checked whether the main predictors were too strongly related to each other. This matters because highly correlated predictors can make coefficient estimates unstable and difficult to interpret.

Figure: Correlation heatmap of the main epidemiological predictors. Blue indicates positive correlation, red indicates negative correlation, and stronger color intensity means a stronger relationship. Incidence and mortality have the strongest positive correlation (r = 0.844). Year is not shown in this heatmap but was included in the VIF screening.

Collinearity conclusion: Max |r| = 0.844 and max VIF = 3.59. The incidence-mortality correlation is high but still below the pre-specified exclusion threshold of 0.85. The maximum VIF is below the common warning level of 5. Therefore, year, incidence, mortality, and case detection were retained together in all primary models.

Attrition Flow

Figure: Horizontal bar chart of sample retention after each filtering step. The chart starts from the merged 2012-2023 sample and shows how many country-years remain after each filter. The final bar is the locked analysis sample of 1,862 country-years (72.2% of the 2012-2023 merged sample).

The final dataset is not an arbitrary subset. Each reduction follows a documented rule: keep the intended study years, keep comparable treatment-outcome definitions, remove invalid or incomplete records, and exclude very small cohorts. The same locked sample is used for all three Bayesian models.

Region-Year Coverage

Figure: Number of retained countries in each WHO region and year after filtering. Each cell shows how many countries from that region are available in the locked analysis table for that year. Darker green cells mean more retained countries. All six WHO regions remain represented across the full 2012-2023 period, although coverage is uneven. EUR and AFR have the largest retained counts in most years, while SEA has fewer countries because the region itself contains fewer countries. This supports the use of region fixed effects, but also shows why region-level comparisons should consider different sample sizes.

Region-Year Success Heatmap

Figure: Average treatment success by WHO region and year. Each cell shows the mean success rate for one WHO region in one year. The plot shows that treatment success is generally high, but not equal across regions. SEA, AFR, EMR, and WPR mostly stay in the low-to-high 80% range, while EUR and AMR are lower in many years. This supports including WHO region in the models, because regional differences are visible before modeling.

Exploratory Patterns

Treatment success is generally high, but it is not uniform. Success rates differ by WHO region, change over time, and include a lower tail of country-years with much poorer outcomes. These patterns suggest that a simple binomial model may be too rigid, because the data vary more than ordinary sampling noise alone would explain.

Success Rates by Region

Figure: Distribution of observed treatment success rates by WHO region. Most country-years have high success rates, mainly between about 80% and 95%, but the distributions differ by region. Some regions have wider spreads and lower tails, meaning that a number of country-years have much poorer outcomes. This suggests that ordinary binomial variation alone may be too restrictive and motivates models with overdispersion and country-level heterogeneity.

Temporal Trends

Figure: Mean treatment success rate over time by WHO region. Each panel shows the yearly average for one region. The trends are not identical: AFR improves over time, EUR remains lower than most other regions, and other regions show smaller fluctuations. These differences show that treatment success changes across both region and time, supporting the use of year and region effects in the models.

Bivariate Relationships

Figure: Bivariate relationships between treatment success and the main epidemiological predictors before Bayesian modeling. Each point is a country-year observation, colored by WHO region, and the black curve shows the overall smoothed pattern. The relationships are not purely linear and there is substantial spread around the trends. This supports using a flexible count-based modeling approach rather than relying only on simple descriptive correlations.

Taken together, the EDA shows three important features: high average success, clear regional differences, and extra variation across country-years. This is why the model comparison starts with a simple binomial model and then adds overdispersion and country-level heterogeneity.

Models and Priors

We compare three models of increasing complexity. Each model adds one new feature to test whether that feature improves fit.

Model Roadmap

The models form a deliberate sequence. Each step tests one additional source of variation.

Model	Likelihood	Linear Predictor	Added Feature	Question Answered	Expected Weakness
M1	\(Y_{it} \sim \text{Binomial}(n_{it}, p_{it})\)	\(\text{logit}(p_{it}) = \beta_0 + \mathbf{x}_{it}^\top\boldsymbol{\beta} + \gamma_{r[i]}\)	None	Is binomial variation enough?	Underestimates variance if overdispersion exists
M2	\(Y_{it} \sim \text{Beta-Binomial}(n_{it}, \mu_{it}, \phi)\)	\(\text{logit}(\mu_{it}) = \beta_0 + \mathbf{x}_{it}^\top\boldsymbol{\beta} + \gamma_{r[i]}\)	\(\phi\)	Does extra variability improve fit?	Ignores persistent country differences
M3	\(Y_{it} \sim \text{Beta-Binomial}(n_{it}, \mu_{it}, \phi)\)	\(\text{logit}(\mu_{it}) = \beta_0 + \mathbf{x}_{it}^\top\boldsymbol{\beta} + \gamma_{r[i]} + u_i\)	\(\phi\), \(u_i \sim N(0, \sigma_u^2)\)	Do countries have stable differences?	More complex; may miss extreme tails

M1 assumes variation comes only from binomial sampling. M2 adds \(\phi\) to allow extra-binomial variation. M3 adds \(u_i\), a country-specific random effect, so countries can have persistent differences after adjusting for predictors and WHO region.

Common Notation

All three models share the following notation:

\(Y_{it}\): Number of successful treatments in country \(i\) and year \(t\)
\(n_{it}\): Treatment cohort size in country \(i\) and year \(t\)
\(i\): Country index; \(t\): Year index
\(\mathbf{x}_{it}\): Standardized predictors (year, incidence, mortality, case detection)
\(\boldsymbol{\beta}\): Predictor coefficients
\(\beta_0\): Baseline log-odds of treatment success
\(\gamma_{r[i]}\): WHO region fixed effect for country \(i\)’s region
\(\gamma_1 = 0\): AFR is the reference region; other region effects are relative to AFR
\(\text{logit}(p) = \log(p/(1-p))\): Link function connecting probability to linear predictor

M1: Binomial

\[Y_{it} \sim \text{Binomial}(n_{it}, p_{it}), \quad \text{logit}(p_{it}) = \beta_0 + \mathbf{x}_{it}^\top \boldsymbol{\beta} + \gamma_{r[i]}\]

M1 assumes that once predictors and region are included, the remaining variation is only ordinary binomial sampling variation. If M1 fails posterior predictive checks, the data vary more than this simple model allows.

What M1 tests: Whether binomial sampling variation is sufficient. If it fits poorly, the data require extra variation.

M2: Beta-Binomial

\[Y_{it} \sim \text{Beta-Binomial}(n_{it}, \mu_{it}, \phi), \quad \text{logit}(\mu_{it}) = \beta_0 + \mathbf{x}_{it}^\top \boldsymbol{\beta} + \gamma_{r[i]}\]

M2 uses the same predictors as M1 but changes the likelihood to Beta-Binomial. We use the mean-precision parameterization. Equivalently, the success probability can be viewed as a latent probability:

\[ \theta_{it} \sim \text{Beta}(\mu_{it}\phi, (1-\mu_{it})\phi), \qquad Y_{it} \mid \theta_{it} \sim \text{Binomial}(n_{it}, \theta_{it}) \]

Here, \(\mu_{it}\) is the mean success probability and \(\phi\) is the precision parameter:

Smaller \(\phi\) = more extra variability
Larger \(\phi\) = less extra variability
As \(\phi \to \infty\), M2 approaches M1

Variance formula:

\[ \mathrm{Var}(Y_{it}) \;=\; n_{it} \cdot \mu_{it} \cdot (1-\mu_{it}) \;\times\; \frac{n_{it}+\phi}{1+\phi} \]

The first term, \(n_{it} \cdot \mu_{it} \cdot (1-\mu_{it})\), is the binomial variance. The second term, \((n_{it}+\phi)/(1+\phi)\), is the inflation factor. When this factor exceeds 1, M2 allows more variability than the binomial model.

What M2 tests: Whether extra-binomial variation exists. If M2 fits much better than M1, overdispersion is present.

M3: Hierarchical

\[Y_{it} \sim \text{Beta-Binomial}(n_{it}, \mu_{it}, \phi), \quad \text{logit}(\mu_{it}) = \beta_0 + \mathbf{x}_{it}^\top \boldsymbol{\beta} + \gamma_{r[i]} + u_i, \quad u_i \sim N(0, \sigma_u^2)\]

M3 keeps the beta-binomial likelihood from M2 and adds \(u_i\), a country-specific random intercept:

\(u_i\): Country random effect. Positive \(u_i\) = higher success than expected; negative \(u_i\) = lower success than expected.
\(\sigma_u\): Standard deviation of country effects. Larger \(\sigma_u\) = stronger persistent differences between countries.

M3 separates two sources of extra variation: \(\phi\) captures general overdispersion, while \(\sigma_u\) captures systematic country differences.

What M3 tests: Whether countries have stable differences not explained by predictors or region. If M3 fits better than M2, persistent country-level heterogeneity is needed.

Prior Specification

All priors are weakly informative: they prevent unrealistic values but are wide enough for the data to drive final estimates.

Prior distributions. JAGS uses precision in dnorm(), so dnorm(0, 0.16) = Normal(0, 2.5^2) since precision = 1/2.5^2 = 0.16.
Parameter	Prior	JAGS syntax	Meaning
beta_0, beta_j, gamma_r	Normal(0, 2.5^2)	dnorm(0, 0.16)	Regression coefficients centered at zero; SD=2.5 is wide on logit scale
gamma_1 (AFR)	Fixed at 0	gamma[1] <- 0	AFR is reference region
phi	Gamma(2, 0.1)	dgamma(2, 0.1)	Overdispersion; prior mean 20
sigma_u	Half-Normal(0, 1)	dnorm(0, 1) T(0,)	Country RE SD; must be positive

Prior Predictive Checks

A prior predictive check asks: what kind of data would the priors generate before seeing the real data? The goal is to verify that priors allow realistic patterns.

Left panels: Simulated mean success rates
Right panels: Simulated standard deviations
Red dashed line: Observed value from the locked dataset

Figure: Prior predictive checks. The priors generate a wide range of plausible success levels. The observed mean (0.86) falls within the prior predictive range, and the priors allow more variability than observed, which is appropriate for weakly informative priors.

The priors are broad enough to let the data dominate while preventing unrealistic values. This supports the prior choice.

Computation and Recovery

All models were fitted with MCMC in JAGS. M1 and M2 converged easily. M3 required a special parameterization to resolve mixing problems.

MCMC Settings

All three Bayesian models were fitted in JAGS (rjags) on the same locked 1,862-row dataset.

Setting	M1	M2	M3
Chains	4	4	4 (parallel)
Adaptation	1,000	1,000	1,000
Burn-in	4,000	4,000	8,000
Post-burn-in	8,000	8,000	10,000
Thinning	1	1	10
Global seed	2026	2026	2026

M3 Fitting Challenge

M1 and M2 converged with standard parameterizations. M3 was difficult.

The problem in plain language: In M3, the intercept (\(\beta_0\)), region effects (\(\gamma_r\)), and the average of country effects within each region can trade off against each other. If the intercept increases, the region effects can decrease, and vice versa. This makes it hard for MCMC to explore the posterior efficiently.

Technical details: The plain centered parameterization had min ESS \(\approx\) 12 and max R-hat \(\approx\) 1.16 on \(\beta_0\) and \(\gamma\). Standard non-centered parameterizations did not resolve the issue because they only separate scale, not location.

The solution: We used a region-centered non-centered parameterization:

\[ u_c = \sigma_u \, (z_c - \bar{z}_{r(c)}), \quad z_c \sim N(0,1) \]

This forces the country effects within each region to sum to zero: \(\sum_{c \in r} u_c = 0\). This constraint separates region effects from country effects and resolves the identifiability problem.

Implementation efficiency: To compute the within-region mean quickly, we permuted country IDs to be contiguous within each region. This allows fast summation instead of dense matrix operations. Four chains ran in parallel with independent random seeds.

MCMC Diagnostics

All three models passed the acceptance criteria: R-hat \(\leq\) 1.05 and ESS \(\geq\) 400 for key parameters.

Compact MCMC diagnostics for each model’s final fit. The Source column shows which diagnostic file was used. Max R-hat should be near 1.0. Min ESS should be at least 400. Status PASS means the fit is acceptable for inference.
Model	Source	Max_Rhat	Min_ESS	Status
M1	m1_extended_diagnostics_summary.csv	1.008	557.005	PASS
M2	m2_extended_diagnostics_summary.csv	1.005	484.523	PASS
M3	m3_regioncentered_full_diagnostics_summary.csv	1.007	407.148	PASS

Acceptance rule: Max R-hat \(\leq\) 1.05 and min ESS \(\geq\) 400 for key parameters (\(\beta_0\), \(\beta_{1:4}\), \(\gamma_{2:6}\), \(\phi\), \(\sigma_u\)).

Results:

M1: Passes easily with standard parameterization
M2: Passes easily with standard parameterization
M3: Passes with region-centered parameterization (min ESS = 407.1, max R-hat = 1.0072)

Figure: M1 trace and density diagnostic for a selected monitored parameter. The trace plot shows that the MCMC samples fluctuate around a stable level without a visible trend, which is what we expect from a well-mixing chain. The density plot summarizes the posterior distribution of the parameter after sampling. Together with the R-hat and ESS results above, this supports acceptable convergence for M1.

Figure: M2 trace and density diagnostics for selected regression and overdispersion parameters. The trace plots show stable sampling with no clear drift over iterations, and the density plots show smooth posterior distributions. This indicates that the beta-binomial model mixed well and that its posterior samples can be used for inference.

Figure: M3 trace and density diagnostics for selected parameters after using the region-centered parameterization. The trace plots show stable sampling for the region effect, overdispersion parameter, and country-level heterogeneity parameter. The density plots are smooth and concentrated, suggesting that the final M3 fit mixes well. This supports the conclusion that the reparameterized hierarchical model is reliable for posterior inference.

Overall, the trace and density plots show stable sampling behavior for the monitored parameters. The visual diagnostics agree with the numerical diagnostics: all final models satisfy the convergence rule of R-hat at or below 1.05 and ESS above 400 for the key monitored parameters. Therefore, the posterior samples are acceptable for the model comparison and substantive interpretation.

Parameter Recovery

We simulated data from each model and refitted it. The 95% credible intervals captured the true values about 93-97% of the time, supporting good calibration under the simulated recovery design.

What is Parameter Recovery?

Parameter recovery tests whether our Bayesian procedure can correctly estimate parameters when we know the true values. We:

Take the posterior mean parameters from the real-data fit
Simulate new response data \(Y\) from the model (keeping the design fixed)
Refit the model to the simulated data
Check whether the new posterior intervals contain the true values

If coverage is near 95%, the inference is well-calibrated.

Executed Design

Recovery replicates completed. The original target was 50 replicates per model. Due to computational cost (especially for M3), we ran 30/30/10. This is a practical accommodation, not a methodological failure. All executed replicates converged.
model	total_reps	n_ok	mean_runtime_s	total_runtime_s
M1	30	30	140.697	4220.897
M2	30	30	3084.431	92532.918
M3	10	10	5196.993	51969.931

Recovery summary. The ‘executed’ column shows successful/total replicates. Mean coverage shows how often 95 percent credible intervals contain the true value. Values near 0.95 indicate good calibration.
model	executed	failure_rate_pct	mean_coverage_95
M1	30/30	0%	0.930
M2	30/30	0%	0.958
M3	10/10	0%	0.975

Full Recovery Results

Recovery performance by parameter. Bias shows systematic over/under-estimation. RMSE combines bias and variance. Coverage 95 shows how often intervals contain the true value.
model	parameter	true_value	n_reps	mean_estimate	bias	abs_bias	rmse	coverage_95	mean_rhat	min_ess
M1	beta0	2.048	30	2.048	0.000	0.001	0.001	0.933	1.011	112.728
M1	beta[1]	0.039	30	0.038	0.000	0.000	0.000	0.867	1.002	895.780
M1	beta[2]	-0.103	30	-0.103	0.000	0.000	0.000	0.967	1.004	359.979
M1	beta[3]	-0.097	30	-0.097	0.000	0.001	0.001	0.967	1.009	166.757
M1	beta[4]	0.092	30	0.092	0.000	0.000	0.000	0.933	1.002	1030.725
M1	gamma[2]	-1.169	30	-1.169	0.000	0.002	0.002	0.933	1.006	178.988
M1	gamma[3]	0.439	30	0.439	0.000	0.001	0.002	0.967	1.006	208.677
M1	gamma[4]	-1.110	30	-1.110	0.000	0.002	0.002	0.967	1.004	180.538
M1	gamma[5]	-0.288	30	-0.289	0.000	0.001	0.001	0.900	1.011	104.031
M1	gamma[6]	0.121	30	0.120	0.000	0.001	0.002	0.867	1.010	120.294
M2	beta0	1.472	30	1.472	-0.001	0.029	0.035	0.967	1.008	176.681
M2	beta[1]	0.017	30	0.021	0.004	0.013	0.016	0.933	1.001	1702.397
M2	beta[2]	0.244	30	0.235	-0.009	0.033	0.041	0.933	1.006	252.378
M2	beta[3]	-0.280	30	-0.269	0.011	0.031	0.035	1.000	1.006	187.475
M2	beta[4]	-0.056	30	-0.052	0.004	0.018	0.023	0.933	1.002	866.026
M2	gamma[2]	-0.399	30	-0.403	-0.004	0.037	0.049	1.000	1.005	248.227
M2	gamma[3]	0.042	30	0.038	-0.004	0.049	0.059	1.000	1.005	327.746
M2	gamma[4]	-0.397	30	-0.394	0.003	0.040	0.052	0.933	1.006	207.651
M2	gamma[5]	0.129	30	0.133	0.004	0.066	0.085	0.900	1.002	507.378
M2	gamma[6]	-0.092	30	-0.093	-0.001	0.038	0.050	0.967	1.004	283.578
M2	phi	10.452	30	10.457	0.005	0.244	0.316	0.967	1.001	1494.996
M3	beta0	1.768	10	1.768	0.000	0.016	0.017	1.000	1.022	66.844
M3	beta[1]	-0.012	10	-0.005	0.007	0.007	0.008	1.000	1.004	609.049
M3	beta[2]	0.283	10	0.301	0.018	0.051	0.060	0.900	1.043	42.105
M3	beta[3]	-0.387	10	-0.392	-0.005	0.036	0.048	0.900	1.050	48.056
M3	beta[4]	0.024	10	0.015	-0.009	0.021	0.023	1.000	1.011	180.275
M3	gamma[2]	-0.652	10	-0.637	0.015	0.025	0.029	1.000	1.013	83.277
M3	gamma[3]	-0.152	10	-0.158	-0.006	0.033	0.045	1.000	1.010	109.472
M3	gamma[4]	-0.741	10	-0.720	0.021	0.044	0.049	1.000	1.019	79.101
M3	gamma[5]	0.062	10	0.077	0.015	0.052	0.068	0.900	1.013	314.000
M3	gamma[6]	-0.303	10	-0.288	0.016	0.030	0.037	1.000	1.029	112.944
M3	phi	42.861	10	42.564	-0.296	0.981	1.251	1.000	1.002	1625.500
M3	sigma_u	0.717	10	0.677	-0.040	0.040	0.043	1.000	1.015	68.918

Figure: Recovery performance measured by 95% credible-interval coverage for each parameter in M1, M2, and M3. The dashed horizontal line marks the nominal 95% target. Bars close to this line indicate that the fitted model usually recovers the true parameter value at the expected rate. Overall, coverage is close to the target for most parameters, which supports good calibration under the recovery design.

Recovery Conclusions

M1: Mean coverage 0.93 (10 parameters)
M2: Mean coverage 0.96 (11 parameters including \(\phi\))
M3: Mean coverage 0.97 (12 parameters including \(\phi\) and \(\sigma_u\))

All models show good calibration. Biases are small relative to posterior standard deviations. The reduced M3 design (10 replicates) provides less precision but still supports inferential credibility.

Results and Model Checking

M3 is preferred among the three fitted models. It has the lowest DIC and the best posterior predictive performance among the three models, though some residual miscalibration remains.

Model Comparison

Which Model Wins?

M3 (hierarchical beta-binomial) is preferred. The evidence is decisive.

DIC comparison. Lower DIC is better. M3 has the lowest DIC by a large margin. The difference of 2,220 between M3 and M2 is strong evidence in favor of M3.
Model	D_bar	D_theta_bar	p_D	DIC	Delta_DIC	Rank	Evidence
M1	2666291.11	2666281.03	10.074	2666301.18	2641361.130	3	Strong evidence against
M2	27149.69	27138.85	10.844	27160.54	2220.488	2	Strong evidence against
M3	24761.35	24582.64	178.703	24940.05	0.000	1	Best model

How to interpret DIC differences:

ΔDIC	Evidence
> 10	Strong evidence for lower-DIC model
5-10	Moderate evidence
< 5	Interpret cautiously

Here, ΔDIC between M3 and M2 is 2,220 — overwhelming evidence. The difference between M2 and M1 is 2.6 million — M1 is severely inadequate for these data.

Posterior Predictive Checks

Posterior predictive checks (PPC) test whether the model can generate data that looks like what we observed. M3 calibrates the mean well and approximately captures the variance, while lower-tail and within-region variance miscalibration remains. M1 fails badly.

How to Read This Table

What are Bayesian p-values? For each test quantity, the Bayesian p-value reports where the observed statistic falls relative to the posterior predictive distribution. Values near 0.5 indicate good calibration; values near 0 or 1 indicate systematic mismatch.

p near 0.5: The model reproduces this feature well
p near 0 or 1: The model systematically over- or under-predicts this feature

Important: These are not frequentist p-values. We are not testing a null hypothesis. We are checking calibration.

Posterior predictive check results. T1a/T1b test central tendency. T2 tests variance. T3 tests the lower tail. T4a/T4b test within-region variance. Values near 0.5 indicate good calibration.
model	test_quantity	observed	rep_mean	rep_sd	p_value
M1	T1a: Unweighted mean success	0.791	0.821	0.000	1.000
M1	T1b: Cohort-weighted aggregate	0.856	0.856	0.000	0.498
M1	T2: Variance of success rates	0.020	0.006	0.000	0.000
M1	T3: Count below 0.70	309.000	50.370	4.965	0.000
M1	T4a: Within-region var (equal wt)	0.016	0.001	0.000	0.000
M1	T4b: Within-region var (size wt)	0.018	0.001	0.000	0.000
M2	T1a: Unweighted mean success	0.791	0.783	0.004	0.022
M2	T1b: Cohort-weighted aggregate	0.856	0.828	0.012	0.005
M2	T2: Variance of success rates	0.020	0.017	0.001	0.000
M2	T3: Count below 0.70	309.000	455.888	22.794	1.000
M2	T4a: Within-region var (equal wt)	0.016	0.015	0.001	0.060
M2	T4b: Within-region var (size wt)	0.018	0.015	0.001	0.000
M3	T1a: Unweighted mean success	0.791	0.788	0.002	0.061
M3	T1b: Cohort-weighted aggregate	0.856	0.852	0.008	0.321
M3	T2: Variance of success rates	0.020	0.019	0.001	0.042
M3	T3: Count below 0.70	309.000	360.075	13.410	1.000
M3	T4a: Within-region var (equal wt)	0.016	0.015	0.001	0.004
M3	T4b: Within-region var (size wt)	0.018	0.017	0.001	0.019

PPC Test Statistics

The following posterior predictive plots compare the observed test statistics with their posterior predictive distributions for each model. They make the model comparison visually clear: M1 shows severe misfit, M2 improves the fit but remains imperfect, and M3 provides the best overall calibration among the three models.

Figure: Posterior predictive checks for M1. In each panel, the blue histogram and curve show the model’s replicated values, while the red vertical line shows the observed value from the real data. M1 matches the cohort-weighted mean reasonably well, but it clearly misses the variance, the number of low-success observations, and the within-region variability. This shows that the simple binomial model is too rigid for these data.

Figure: Posterior predictive checks for M2. Compared with M1, the beta-binomial model produces replicated data that are closer to the observed data for several test statistics, especially the overall spread. However, some mismatch remains, particularly for the lower-tail count and, to a lesser extent, the mean and within-region variability. This indicates that adding overdispersion improves fit, but does not fully capture the structure in the data.

Figure: Posterior predictive checks for M3. The replicated distributions are generally closer to the observed values than in M1 and M2, especially for the cohort-weighted mean and the overall variance. This makes M3 the best-fitting model among the three. Even so, the figure still shows some remaining mismatch for the lower-tail count and within-region variance, so the fit is improved but not perfect.

PPC Results by Model

M1 (Binomial): Severe misfit

Variance test (T2): p ≈ 0 - the binomial predicts far too little variance
Lower-tail test (T3): p ≈ 0 - it predicts far fewer low-success country-years
M1 cannot reproduce the observed data features

M2 (Beta-Binomial): Better but imperfect

Variance test (T2): Improved
Mean test (T1b): p = 0.005 - slightly underpredicts the cohort-weighted mean
Lower-tail test (T3): Overshoots

M3 (Hierarchical): Best among the three

Mean test (T1b): p = 0.32 — well calibrated
Variance test (T2): p = 0.04 — much improved relative to M1 and M2, but still slightly tail-calibrated rather than perfectly centered
Lower-tail test (T3): p = 1.0 — M3 still tends to predict too many country-years below 0.70 success
Within-region tests (T4a/T4b): Some residual issues

Conclusion: M3 is the best model, but not perfect. Remaining lower-tail and within-region variance miscalibration are targets for future model extensions.

Cohort Calibration

The cohort calibration plots check whether each model behaves consistently across different cohort sizes. This is important because the response is modeled as counts, and country-years with larger cohorts should strongly influence the likelihood. These plots provide an additional visual check that complements the PPC test statistics above.

Figure: M1 cohort calibration by cohort size. Each point compares the observed success rate with the mean posterior predictive success rate for one country-year, separately for large and small cohorts. The dashed diagonal line represents perfect calibration: points close to the line mean predicted and observed values agree. Many points are far from the line, showing that the binomial model does not reproduce the observed variability well, especially across cohort-size groups.

Figure: M2 cohort calibration by cohort size. The beta-binomial model allows extra variability, so the points move closer to the diagonal line compared with M1. This shows improved calibration, especially because M2 is less rigid than the simple binomial model. However, several points still deviate from the line, meaning that overdispersion alone does not fully explain the observed country-year differences.

Figure: M3 cohort calibration by cohort size. The hierarchical beta-binomial model gives the closest agreement between observed and predicted success rates, with many points lying near the diagonal line. This indicates that adding country random effects improves calibration for both large and small cohorts. Some dispersion remains, especially among smaller cohorts, but M3 provides the best cohort-level calibration among the three models.

M3 Posterior Predictive Density Overlay

This density overlay compares the observed distribution of treatment success with replicated datasets generated from the posterior predictive distribution of M3. It provides an intuitive final visual check of whether the preferred model reproduces the overall outcome distribution.

Figure: M3 posterior predictive density overlay. The dark red curve shows the observed distribution of treatment success rates, while the light blue curves show 100 datasets simulated from the M3 posterior predictive distribution. The model reproduces the main high-success peak reasonably well, but the simulated curves are slightly more spread out than the observed curve. This supports M3 as a good overall fit, while still showing some remaining mismatch in the lower-success part of the distribution.

PPC Visual: Variance Recovery Across Models

Figure: Variance recovery across models. The red vertical line shows the observed variance of treatment success rates, and each colored density shows the variance generated by posterior predictive simulations from one model. M1 is far to the left, meaning it strongly underestimates the observed variability. M2 and M3 are much closer to the observed variance, with M3 closest overall, although the observed variance is still near the upper edge of the M3 predictive distribution.

M3 Posterior Inference

Under M3, mortality is negatively associated with success, incidence is positively associated (after controlling for country effects), and there is substantial country-level heterogeneity.

M3 posterior summaries. Mean is the posterior mean. CrI = credible interval (Bayesian). Read signs and intervals together to interpret effects.
parameter	mean	median	sd	95% CrI lower	95% CrI upper	hpd_lower	hpd_upper	model	label
beta0	1.768	1.767	0.034	1.702	1.835	1.702	1.835	M3 (Hierarchical)	Intercept
beta[1]	-0.012	-0.012	0.010	-0.032	0.009	-0.032	0.008	M3 (Hierarchical)	Year (standardized)
beta[2]	0.283	0.284	0.053	0.180	0.385	0.179	0.385	M3 (Hierarchical)	Incidence (standardized)
beta[3]	-0.387	-0.387	0.044	-0.473	-0.300	-0.474	-0.301	M3 (Hierarchical)	Mortality (standardized)
beta[4]	0.024	0.024	0.024	-0.022	0.070	-0.021	0.071	M3 (Hierarchical)	Case Detection (standardized)
gamma[2]	-0.652	-0.652	0.054	-0.756	-0.546	-0.757	-0.548	M3 (Hierarchical)	AMR
gamma[3]	-0.152	-0.151	0.048	-0.245	-0.058	-0.245	-0.058	M3 (Hierarchical)	EMR
gamma[4]	-0.741	-0.741	0.057	-0.853	-0.629	-0.854	-0.630	M3 (Hierarchical)	EUR
gamma[5]	0.062	0.061	0.054	-0.043	0.168	-0.043	0.167	M3 (Hierarchical)	SEA
gamma[6]	-0.303	-0.303	0.048	-0.397	-0.209	-0.395	-0.208	M3 (Hierarchical)	WPR
phi	42.861	42.842	1.690	39.621	46.258	39.555	46.169	M3 (Hierarchical)	Overdispersion (phi)
sigma_u	0.717	0.715	0.041	0.639	0.804	0.634	0.798	M3 (Hierarchical)	Country RE SD (sigma_u)

Key Findings

Intercept (\(\beta_0\) = 1.77): The baseline log-odds corresponds to roughly 85% success probability at reference predictor values.

Mortality (\(\beta\) = −0.39, 95% CrI: −0.47 to −0.30): Higher mortality is associated with lower treatment success. The entire posterior is below zero (P < 0 = 1.000).

Incidence (\(\beta\) = +0.28, 95% CrI: 0.18 to 0.39): Higher incidence is associated with higher success after controlling for country effects. This sign reversal suggests strong between-country confounding. Persistent low-performing countries tend to have high incidence; after country-level differences are absorbed, the conditional incidence association changes direction. P > 0 = 1.000.

Year and case detection: Both effects are weak and posteriorly near zero.

Overdispersion (\(\phi\) = 42.9, 95% CrI: 39.6-46.3): The finite value confirms overdispersion is present. The data have more variance than binomial sampling alone would predict.

Country heterogeneity (\(\sigma_u\) = 0.72, 95% CrI: 0.64-0.80): Substantial between-country variation on the logit scale. Countries differ persistently in treatment success beyond what covariates explain.

Country Random Effects

M3 estimates a random effect for each country. Countries with positive effects have persistently higher success than expected; those with negative effects have persistently lower success.

Figure: Country random effects from the M3 hierarchical model. Each point is the posterior mean country effect, and each horizontal line is its 95% credible interval on the logit scale. Countries to the right of zero have higher treatment success than expected after adjusting for region and predictors, while countries to the left of zero have lower success than expected. This plot shows persistent country-level differences, but it should be read as descriptive benchmarking, not as a causal ranking.

Highest and Lowest Country Effects

Countries with the highest positive random effects. These countries have persistently higher treatment success than expected after adjusting for region and covariates. CrI = credible interval.
iso3	mean	95% CrI lower	95% CrI upper	g_whoregion
MLT	1.753	1.044	2.550	EUR
SLB	1.069	0.776	1.386	WPR
MNE	1.028	0.701	1.384	EUR
CHN	1.023	0.727	1.351	WPR
TJK	0.997	0.741	1.279	EUR
SVK	0.979	0.718	1.266	EUR
COD	0.962	0.633	1.326	AFR
SLV	0.914	0.663	1.183	AMR
TZA	0.902	0.634	1.199	AFR
KHM	0.899	0.598	1.233	WPR

Countries with the lowest (most negative) random effects. These countries have persistently lower treatment success than expected after adjusting for region and covariates. CrI = credible interval.
iso3	mean	95% CrI lower	95% CrI upper	g_whoregion
HRV	-1.383	-1.569	-1.197	EUR
JAM	-1.461	-1.693	-1.231	AMR
BHR	-1.478	-1.747	-1.208	EMR
AGO	-1.488	-1.679	-1.297	AFR
DNK	-1.509	-1.701	-1.319	EUR
IRL	-1.741	-1.944	-1.545	EUR
NCL	-1.860	-2.558	-1.161	WPR
FIN	-1.926	-2.138	-1.719	EUR
FRA	-2.003	-2.238	-1.781	EUR
GRC	-3.633	-4.396	-2.958	EUR

Interpretation: These rankings are adjusted for region and covariates. They describe which countries over- or under-perform relative to their predicted baseline. They do not identify causes of good or poor performance.

Robustness Checks

Completed robustness checks provide supplementary support for the main findings. Two prior-sensitivity arms were specified but not executed, so they are not used as evidence for the main conclusion.

Summary

Check	Status	Main Conclusion
Cohort Threshold	Completed (frequentist)	Coefficient signs stable
TB-HIV Predictor	Completed (frequentist)	TB-HIV negative; main effects stable
Post-2021 Definitions	Completed (frequentist)	Main signs preserved
Phi Prior	Specification only	Not executed
Sigma Prior	Specification only	Not executed
Frequentist M1/M2	Completed	Coefficients agree with Bayesian
M3 GLMM	Failed	Matrix/lme4 binary incompatibility

Three sensitivity checks were completed using frequentist refits. Two prior-sensitivity arms were specified but not executed, so they are not used as evidence for the main conclusion. The main Bayesian conclusion remains based on the locked dataset and standard priors.

The M3 GLMM failure is due to a known software issue (function 'cholmod_factor_ldetA' not provided by package 'Matrix'), not a model specification problem. Since frequentist checks are supplementary, this is a non-fatal limitation.

Discussion

M3 best answers our research question. Binomial variability alone is insufficient; we need both overdispersion and persistent country effects.

Key Interpretations

Overdispersion (\(\phi = 42.9\)): Treatment success varies more than binomial sampling alone can explain. This extra variation is real and must be modeled.

Country heterogeneity (\(\sigma_u = 0.72\)): Countries have persistent performance differences that covariates and region cannot explain. This is substantively important: some countries consistently outperform or underperform their predicted baseline.

Incidence sign reversal: The incidence coefficient changes from negative in M1 to positive in M3. This sign reversal suggests strong between-country confounding. Persistent low-performing countries tend to have high incidence; after country-level differences are absorbed, the conditional incidence association changes direction.

Limitations

Ecological fallacy: Country-level data cannot prove individual-level causal claims
Association, not causation: All results are associative
Residual miscalibration: M3 still misfits the lower tail (T3 p = 1.0) and within-region variance
Reporting heterogeneity: Countries differ in data quality
Missingness: Filtered observations may not be missing at random
Outcome-definition changes: The shift in WHO treatment-outcome definitions over time may introduce discontinuities. The rel_with_new_flg == 1 filter improves comparability, and the post-2021 definition check is used as a sensitivity analysis, but definition changes cannot be ruled out completely.

Future Work

Extensions could include region-varying overdispersion, temporal random effects, mixture models for the lower tail, and out-of-sample prediction.

Conclusion

The hierarchical beta-binomial model (M3) is preferred among the three fitted models for modeling country-year TB treatment success. It captures both overdispersion (\(\phi = 42.9\)) and persistent country heterogeneity (\(\sigma_u = 0.72\)), with the lowest DIC (24,940) and the best posterior predictive performance among the three models.

The evidence is decisive: M1 (binomial) severely underestimates variability (DIC = 2.6 million), and M2 (beta-binomial) improves fit but still lacks country structure (DIC = 27,160). M3’s additional complexity is justified by substantially better fit.

Limitations and future work: M3 is the best among these three models, not a perfect model. Some lower-tail and within-region variance miscalibration remains. Future extensions could include region-varying overdispersion, temporal random effects, or mixture models to better capture the lower tail.

Reproducibility

This analysis can be reproduced by running src/main.R.

Data Provenance

Item	Value
Source	World Health Organization Global TB Programme
Download date	2026-04-04
Year window	2012-2023
Final sample	1,862 country-years, 180 countries
Cohort threshold	`cohort >= 50`
Global seed	2026

File Structure

Location	Contents
`src/main.R`	Main analysis script
`src/models/`	JAGS model files
`src/outputs/tables/`	CSV tables
`src/outputs/figures/`	Figures
`data/data_processed/`	Locked analysis table

How to Reproduce

Regenerate the HTML report:

Rscript -e "rmarkdown::render('src/report/report.Rmd', output_format = 'html_document', output_file = 'Arman_Feili_FSL2_Final_Report.html', output_dir = 'src/report')"

Run the full analysis:

Rscript src/main.R

Software

R: R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
JAGS: 4.3.2
rjags: 4.17
coda: 0.19.4.1
ggplot2: 3.5.2
dplyr: 1.1.4
knitr: 1.50
rmarkdown: 2.29

References

World Health Organization. Global Tuberculosis Programme data. Downloaded 2026-04-04. Available at: https://www.who.int/teams/global-tuberculosis-programme/data
Plummer M. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. Proceedings of the 3rd International Workshop on Distributed Statistical Computing, 2003.
Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B, 2002; 64(4):583-639.
Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis, 3rd ed. Chapman and Hall/CRC, 2013.

Bayesian Modeling of Cross-Country Tuberculosis Treatment Success

A Fully Bayesian MCMC Analysis of WHO Data, 2012-2023

Name & Surname: Arman Feili Matricola: 2101835 Email: feili.2101835@studenti.uniroma1.it