Posterior Predictive Checks
Posterior predictive checks (PPC) test whether the model can generate
data that looks like what we observed. M3 calibrates the mean well and
approximately captures the variance, while lower-tail and within-region
variance miscalibration remains. M1 fails badly.
How to Read This Table
What are Bayesian p-values? For each test quantity,
the Bayesian p-value reports where the observed statistic falls relative
to the posterior predictive distribution. Values near 0.5 indicate good
calibration; values near 0 or 1 indicate systematic mismatch.
- p near 0.5: The model reproduces this feature
well
- p near 0 or 1: The model systematically over- or
under-predicts this feature
Important: These are not frequentist p-values. We
are not testing a null hypothesis. We are checking calibration.
Posterior predictive check results. T1a/T1b test central tendency. T2
tests variance. T3 tests the lower tail. T4a/T4b test within-region
variance. Values near 0.5 indicate good calibration.
|
model
|
test_quantity
|
observed
|
rep_mean
|
rep_sd
|
p_value
|
|
M1
|
T1a: Unweighted mean success
|
0.791
|
0.821
|
0.000
|
1.000
|
|
M1
|
T1b: Cohort-weighted aggregate
|
0.856
|
0.856
|
0.000
|
0.498
|
|
M1
|
T2: Variance of success rates
|
0.020
|
0.006
|
0.000
|
0.000
|
|
M1
|
T3: Count below 0.70
|
309.000
|
50.370
|
4.965
|
0.000
|
|
M1
|
T4a: Within-region var (equal wt)
|
0.016
|
0.001
|
0.000
|
0.000
|
|
M1
|
T4b: Within-region var (size wt)
|
0.018
|
0.001
|
0.000
|
0.000
|
|
M2
|
T1a: Unweighted mean success
|
0.791
|
0.783
|
0.004
|
0.022
|
|
M2
|
T1b: Cohort-weighted aggregate
|
0.856
|
0.828
|
0.012
|
0.005
|
|
M2
|
T2: Variance of success rates
|
0.020
|
0.017
|
0.001
|
0.000
|
|
M2
|
T3: Count below 0.70
|
309.000
|
455.888
|
22.794
|
1.000
|
|
M2
|
T4a: Within-region var (equal wt)
|
0.016
|
0.015
|
0.001
|
0.060
|
|
M2
|
T4b: Within-region var (size wt)
|
0.018
|
0.015
|
0.001
|
0.000
|
|
M3
|
T1a: Unweighted mean success
|
0.791
|
0.788
|
0.002
|
0.061
|
|
M3
|
T1b: Cohort-weighted aggregate
|
0.856
|
0.852
|
0.008
|
0.321
|
|
M3
|
T2: Variance of success rates
|
0.020
|
0.019
|
0.001
|
0.042
|
|
M3
|
T3: Count below 0.70
|
309.000
|
360.075
|
13.410
|
1.000
|
|
M3
|
T4a: Within-region var (equal wt)
|
0.016
|
0.015
|
0.001
|
0.004
|
|
M3
|
T4b: Within-region var (size wt)
|
0.018
|
0.017
|
0.001
|
0.019
|
PPC Test Statistics
The following posterior predictive plots compare the observed test
statistics with their posterior predictive distributions for each model.
They make the model comparison visually clear: M1 shows severe misfit,
M2 improves the fit but remains imperfect, and M3 provides the best
overall calibration among the three models.

Figure: Posterior predictive checks for M1. In each panel, the
blue histogram and curve show the model’s replicated values, while the
red vertical line shows the observed value from the real data. M1
matches the cohort-weighted mean reasonably well, but it clearly misses
the variance, the number of low-success observations, and the
within-region variability. This shows that the simple binomial model is
too rigid for these data.

Figure: Posterior predictive checks for M2. Compared with M1, the
beta-binomial model produces replicated data that are closer to the
observed data for several test statistics, especially the overall
spread. However, some mismatch remains, particularly for the lower-tail
count and, to a lesser extent, the mean and within-region variability.
This indicates that adding overdispersion improves fit, but does not
fully capture the structure in the data.

Figure: Posterior predictive checks for M3. The replicated
distributions are generally closer to the observed values than in M1 and
M2, especially for the cohort-weighted mean and the overall variance.
This makes M3 the best-fitting model among the three. Even so, the
figure still shows some remaining mismatch for the lower-tail count and
within-region variance, so the fit is improved but not perfect.
PPC Results by Model
M1 (Binomial): Severe misfit
- Variance test (T2): p ≈ 0 - the binomial predicts far too little
variance
- Lower-tail test (T3): p ≈ 0 - it predicts far fewer low-success
country-years
- M1 cannot reproduce the observed data features
M2 (Beta-Binomial): Better but imperfect
- Variance test (T2): Improved
- Mean test (T1b): p = 0.005 - slightly underpredicts the
cohort-weighted mean
- Lower-tail test (T3): Overshoots
M3 (Hierarchical): Best among the three
- Mean test (T1b): p = 0.32 — well calibrated
- Variance test (T2): p = 0.04 — much improved relative to M1 and M2,
but still slightly tail-calibrated rather than perfectly centered
- Lower-tail test (T3): p = 1.0 — M3 still tends to predict too many
country-years below 0.70 success
- Within-region tests (T4a/T4b): Some residual issues
Conclusion: M3 is the best model, but not perfect.
Remaining lower-tail and within-region variance miscalibration are
targets for future model extensions.
Cohort Calibration
The cohort calibration plots check whether each model behaves
consistently across different cohort sizes. This is important because
the response is modeled as counts, and country-years with larger cohorts
should strongly influence the likelihood. These plots provide an
additional visual check that complements the PPC test statistics
above.

Figure: M1 cohort calibration by cohort size. Each point compares
the observed success rate with the mean posterior predictive success
rate for one country-year, separately for large and small cohorts. The
dashed diagonal line represents perfect calibration: points close to the
line mean predicted and observed values agree. Many points are far from
the line, showing that the binomial model does not reproduce the
observed variability well, especially across cohort-size
groups.

Figure: M2 cohort calibration by cohort size. The beta-binomial
model allows extra variability, so the points move closer to the
diagonal line compared with M1. This shows improved calibration,
especially because M2 is less rigid than the simple binomial model.
However, several points still deviate from the line, meaning that
overdispersion alone does not fully explain the observed country-year
differences.

Figure: M3 cohort calibration by cohort size. The hierarchical
beta-binomial model gives the closest agreement between observed and
predicted success rates, with many points lying near the diagonal line.
This indicates that adding country random effects improves calibration
for both large and small cohorts. Some dispersion remains, especially
among smaller cohorts, but M3 provides the best cohort-level calibration
among the three models.
M3 Posterior Predictive Density Overlay
This density overlay compares the observed distribution of treatment
success with replicated datasets generated from the posterior predictive
distribution of M3. It provides an intuitive final visual check of
whether the preferred model reproduces the overall outcome
distribution.

Figure: M3 posterior predictive density overlay. The dark red
curve shows the observed distribution of treatment success rates, while
the light blue curves show 100 datasets simulated from the M3 posterior
predictive distribution. The model reproduces the main high-success peak
reasonably well, but the simulated curves are slightly more spread out
than the observed curve. This supports M3 as a good overall fit, while
still showing some remaining mismatch in the lower-success part of the
distribution.
PPC Visual: Variance Recovery Across Models

Figure: Variance recovery across models. The red vertical line
shows the observed variance of treatment success rates, and each colored
density shows the variance generated by posterior predictive simulations
from one model. M1 is far to the left, meaning it strongly
underestimates the observed variability. M2 and M3 are much closer to
the observed variance, with M3 closest overall, although the observed
variance is still near the upper edge of the M3 predictive
distribution.
M3 Posterior Inference
Under M3, mortality is negatively associated with success, incidence
is positively associated (after controlling for country effects), and
there is substantial country-level heterogeneity.
M3 posterior summaries. Mean is the posterior mean. CrI = credible
interval (Bayesian). Read signs and intervals together to interpret
effects.
|
parameter
|
mean
|
median
|
sd
|
95% CrI lower
|
95% CrI upper
|
hpd_lower
|
hpd_upper
|
model
|
label
|
|
beta0
|
1.768
|
1.767
|
0.034
|
1.702
|
1.835
|
1.702
|
1.835
|
M3 (Hierarchical)
|
Intercept
|
|
beta[1]
|
-0.012
|
-0.012
|
0.010
|
-0.032
|
0.009
|
-0.032
|
0.008
|
M3 (Hierarchical)
|
Year (standardized)
|
|
beta[2]
|
0.283
|
0.284
|
0.053
|
0.180
|
0.385
|
0.179
|
0.385
|
M3 (Hierarchical)
|
Incidence (standardized)
|
|
beta[3]
|
-0.387
|
-0.387
|
0.044
|
-0.473
|
-0.300
|
-0.474
|
-0.301
|
M3 (Hierarchical)
|
Mortality (standardized)
|
|
beta[4]
|
0.024
|
0.024
|
0.024
|
-0.022
|
0.070
|
-0.021
|
0.071
|
M3 (Hierarchical)
|
Case Detection (standardized)
|
|
gamma[2]
|
-0.652
|
-0.652
|
0.054
|
-0.756
|
-0.546
|
-0.757
|
-0.548
|
M3 (Hierarchical)
|
AMR
|
|
gamma[3]
|
-0.152
|
-0.151
|
0.048
|
-0.245
|
-0.058
|
-0.245
|
-0.058
|
M3 (Hierarchical)
|
EMR
|
|
gamma[4]
|
-0.741
|
-0.741
|
0.057
|
-0.853
|
-0.629
|
-0.854
|
-0.630
|
M3 (Hierarchical)
|
EUR
|
|
gamma[5]
|
0.062
|
0.061
|
0.054
|
-0.043
|
0.168
|
-0.043
|
0.167
|
M3 (Hierarchical)
|
SEA
|
|
gamma[6]
|
-0.303
|
-0.303
|
0.048
|
-0.397
|
-0.209
|
-0.395
|
-0.208
|
M3 (Hierarchical)
|
WPR
|
|
phi
|
42.861
|
42.842
|
1.690
|
39.621
|
46.258
|
39.555
|
46.169
|
M3 (Hierarchical)
|
Overdispersion (phi)
|
|
sigma_u
|
0.717
|
0.715
|
0.041
|
0.639
|
0.804
|
0.634
|
0.798
|
M3 (Hierarchical)
|
Country RE SD (sigma_u)
|
Key Findings
Intercept (\(\beta_0\) =
1.77): The baseline log-odds corresponds to roughly 85% success
probability at reference predictor values.
Mortality (\(\beta\) =
−0.39, 95% CrI: −0.47 to −0.30): Higher mortality is associated
with lower treatment success. The entire posterior is below zero (P <
0 = 1.000).
Incidence (\(\beta\) =
+0.28, 95% CrI: 0.18 to 0.39): Higher incidence is associated
with higher success after controlling for country effects. This sign
reversal suggests strong between-country confounding. Persistent
low-performing countries tend to have high incidence; after
country-level differences are absorbed, the conditional incidence
association changes direction. P > 0 = 1.000.
Year and case detection: Both effects are weak and
posteriorly near zero.
Overdispersion (\(\phi\) =
42.9, 95% CrI: 39.6-46.3): The finite value confirms
overdispersion is present. The data have more variance than binomial
sampling alone would predict.
Country heterogeneity (\(\sigma_u\) = 0.72, 95% CrI:
0.64-0.80): Substantial between-country variation on the logit
scale. Countries differ persistently in treatment success beyond what
covariates explain.
Country Random Effects
M3 estimates a random effect for each country. Countries with
positive effects have persistently higher success than expected; those
with negative effects have persistently lower success.

Figure: Country random effects from the M3 hierarchical model.
Each point is the posterior mean country effect, and each horizontal
line is its 95% credible interval on the logit scale. Countries to the
right of zero have higher treatment success than expected after
adjusting for region and predictors, while countries to the left of zero
have lower success than expected. This plot shows persistent
country-level differences, but it should be read as descriptive
benchmarking, not as a causal ranking.
Highest and Lowest Country Effects
Countries with the highest positive random effects. These countries have
persistently higher treatment success than expected after adjusting for
region and covariates. CrI = credible interval.
|
iso3
|
mean
|
95% CrI lower
|
95% CrI upper
|
g_whoregion
|
|
MLT
|
1.753
|
1.044
|
2.550
|
EUR
|
|
SLB
|
1.069
|
0.776
|
1.386
|
WPR
|
|
MNE
|
1.028
|
0.701
|
1.384
|
EUR
|
|
CHN
|
1.023
|
0.727
|
1.351
|
WPR
|
|
TJK
|
0.997
|
0.741
|
1.279
|
EUR
|
|
SVK
|
0.979
|
0.718
|
1.266
|
EUR
|
|
COD
|
0.962
|
0.633
|
1.326
|
AFR
|
|
SLV
|
0.914
|
0.663
|
1.183
|
AMR
|
|
TZA
|
0.902
|
0.634
|
1.199
|
AFR
|
|
KHM
|
0.899
|
0.598
|
1.233
|
WPR
|
Countries with the lowest (most negative) random effects. These
countries have persistently lower treatment success than expected after
adjusting for region and covariates. CrI = credible interval.
|
iso3
|
mean
|
95% CrI lower
|
95% CrI upper
|
g_whoregion
|
|
HRV
|
-1.383
|
-1.569
|
-1.197
|
EUR
|
|
JAM
|
-1.461
|
-1.693
|
-1.231
|
AMR
|
|
BHR
|
-1.478
|
-1.747
|
-1.208
|
EMR
|
|
AGO
|
-1.488
|
-1.679
|
-1.297
|
AFR
|
|
DNK
|
-1.509
|
-1.701
|
-1.319
|
EUR
|
|
IRL
|
-1.741
|
-1.944
|
-1.545
|
EUR
|
|
NCL
|
-1.860
|
-2.558
|
-1.161
|
WPR
|
|
FIN
|
-1.926
|
-2.138
|
-1.719
|
EUR
|
|
FRA
|
-2.003
|
-2.238
|
-1.781
|
EUR
|
|
GRC
|
-3.633
|
-4.396
|
-2.958
|
EUR
|
Interpretation: These rankings are adjusted for
region and covariates. They describe which countries over- or
under-perform relative to their predicted baseline. They do not identify
causes of good or poor performance.