A simple Bayesian analysis of the Infection Fatality Rate in Gangelt, and an uncertainty aware extrapolation to infection-counts in Germany¶

Matthias Kümmerer, Philipp Berens, Jakob Macke

view without code | view with code | source code for this post

Abstract¶

We present a simple Bayesian approach to infer the infection fatality rate of Covid-19 from the estimates of the infection rate in the small town of Gangelt, Germany, presented in a recent study. With our modelling assumptions, we find a posterior mean estimate of 0.37%, and a credible interval of 0.12% to 0.67%. We discuss the dependence of these results on the prior assumptions and quantities such as the number of fatalities observed. Finally, we show how the uncertainty in the estimated infection fatality rate would affect an extrapolation to probable infection counts in Germany. Even if the infection fatality rate in Gangelt was the same as in Germany, any such extrapolation would carry a high degree of variability, yielding infection counts which could vary from 0.73 million to 3.94 million. These observations highlight the importance of uncertainty quantification in studies which are based on a small number of observed fatalities.

Introduction¶

With the coronavirus pandemic rapidly developing across the world, estimates of key epidemiological quantities are in critical demand. Maybe like never before, their estimates inform policy decisions, possibly affecting the lives of many and impacting economies world wide. For that reason, providing accurate estimates of these quantities, as well as reliable uncertainty intervals around those estimates is paramount.

Among those key quantities are the infection rate (IR), i.e. the true fraction of infected people in a population, and
the infection fatality rate (IFR), i.e. the fraction of people who die after infection (including asymptomatic and mild cases which remain untested). The latter needs to be distinguished from the case fatality rate, which is defined as the fraction of people dying after being diagnosed with Sars-Cov-19.

A recent study from Germany, the so-called Heinsberg study, aimed at estimating the IR and IFR in a small town called Gangelt in the municipality of Heinsberg, close to Aachen in the West of Germany. This town was hit hard early on in the epidemic due to a super-spreading event caused by Carnival festivities, leading to a (likely) comparably high rate of infected people. While therefore its IR will not be representative for other regions in Germany, it is precisely its high IR which makes it suitable for providing an estimate of the IFR.

The authors combined PCR testing for acute infections and antibody testing, as well as statistical corrections for non-independence among household members and other factors. Among N=919 participants, they thus estimated an IR of 15.53% (12.31%; 18.96%) (95% CI). Noting that 7 people with Covid-19-Diagnosis had died in the time frame of the study, they estimated the IFR to be 0.358% (0.293%; 0.451%) (95% CI). This calculation incorporates uncertainty about the total number of infected people in Gangelt (as the infection count is extrapolated from the tested cases), but not any uncertainty regarding the finite sample of Gangelt as a sample from a larger population.

The authors argue that the observation of 7 deaths in Gangelt is directly observed without error, and therefore does not induce any additional uncertainty. This reasoning could be appropriate if the IFR calculation is exclusively used as a summary statistic describing the data in Gangelt. However, estimates of IFRs are commonly used both in policy-discussions which explicitly or implicitly generalize to regions beyond Gangelt, as well as a means of constraining epidemiological models. Indeed, in the discussion, the authors discuss how their estimated IFR could be used to yield an estimate of the expected number of infected people in Germany, which they state is at least 1.8 Million people, based on the number of 6757 deaths on May 2nd, 2020. The authors do not provide error bars on this quantity, but rather regard it as a `purely theoretical model' (and later explained that their error bars on the IFR would not be valid). Nevertheless, the stated figure of 1.8 Millon infections in Germany has been widely quoted in the press.

Our central questions are:

How could one provide uncertainty estimates (i.e. error bars) on IFRs in a way that includes both the uncertainty about the IR, as well as the uncertainty arising from observing a small number of fatalities?
How would the uncertainty in the IFR translate to an uncertainty in the total number of infected people in Germany?

We show how a Bayesian approach can be used to provide error bars on both the IFR and the number of infected people in Germany, propagating all necessary uncertainties through the inference chain. For a frequentist approach, see this recent preprint.

Our Model:¶

In a first step, we concentrate on estimating the infection fatality rate (IFR) in Gangelt. In a second step, we will show how one can use the IFR (along with the properly quantified uncertainty) to extrapolate to infection-counts in Germany, and how big the resulting error bars would be. We emphasize that both of the estimation procedures will depend on the modelling assumptions, which we will explore below.

We do not attempt to provide an estimate of the infection rate (IR) in Gangelt, but rather directly take the IR-estimate of Streeck et al 2020. See this post for a Bayesian analysis of an infection rate of the Santa-Clara study. In Streeck et al, the infection rate is estimated as 0.1553 with 95%-confidence intervals (0.1231, 0.1896). To preserve this uncertainty, we assume the infection rate to follow a Beta distribution with parameters chosen such that the distribution has the same mean and variance as the IR estimate from Streeck et al.

We then infer a posterior over the infection fatality rate as follows: given a sample from the IR distribution specified above, the number of infected individuals in Gangelt is modelled as binomially distributed depending on the infection rate and the (fixed) number of individuals in Gangelt. Given the number of infected individuals, the number of fatalities is modelled as binomially distributed depending on the infection fatality rate and the number of infections with a weak beta prior on the infection fatality rate. With this model, we compute the posterior over the infection fatality rate using Bayesian inference. The posterior distribution will then capture our beliefs about which IFR are plausible under both the observed data and our prior assumptions.

Inspecting the priors¶

We first visualize the priors: As intended, the prior over the IR has roughly the same confidence interval as the one stated the Gangelt-study (0.1553, [0.1231, 0.1896], mean and 95% CI). By sampling the number of infected people in Gangelt (using a binomial distribution, i.e. assuming that infections are independent from each other), we obtain a distribution over infection counts which is consistent with the statement in the Gangelt study.

For the IFR, we will first use a prior which has a mean of 1%, but still assigns probability to higher IFRs, reflecting our expectations that the IFR will likely not be very high (say, not over 10%), but that -- prior to looking at the data -- we have substantial uncertainty about what the true IFR will be. We will explore the effects of different priors below.

Sampling the posterior¶

Next, we run inference: We try to infer the range of IFRs that is consistent both with our prior assumptions, the estimate of the IR, and the observed number of fatalities in Gangelt. We use Markov Chain Monte Carlo (MCMC) Sampling implemented in PyMC3.

Having obtained samples from the posterior via MCMC, we can look at the marginal posterior over the IFR, i.e. the range of IFR values that is consistent with our model, our priors, and the data.

We find that the mean of the posterior is at 0.37%, very similar to the value reported in the Heinsberg-Study (which reported 0.36%), but that the confidence (or credible) interval is substantially wider (0.12% to 0.67%) vs (0.29% to 0.45%). Did our choice of prior cause this bigger confidence interval?

Estimates of the infection fatality rate for different prior assumptions¶

To explore the influence of different prior assumptions on the estimate and uncertainty of the infection fatality rate, we repeat the inference for different priors for the infection fatality rate: First, a prior mean a mean of 1% and a variance of 1.8%, which we used in the illustration above. Second, the wider prior with mean and standard deviation of 10% prio, and third, a more wide prior with mean 25% and variance 19.4%. Finally, for illustration of an extreme case, we also use a uniform prior assigning equal probability to each value between 0% and 100% (and which thus has a prior mean of 50%).

As expected, we find that the estimates of the IFR depend on the choice of prior, but the effect is only mild: The mean IFR varies between 0.37% and 0.42%, so only ever so slightly. Even on the credible region, the influence is not dramatic-- the lower end varies from 0.12% to 0.14%, the upper one from 0.67% to 0.74%.

Extrapolating to infection counts in Germany¶

We next use the knowledge of the range of values of the IFR in Gangelt to try to estimate the infection count in Germany. Doing this, we assume that the IFR in Germany is the same as in Gangelt, and furthermore we assume that the number of fatalities with Covid-19 is accurately recorded. In this case, one can infer the infection count that would be consistent with these two quantities. Indeed, for a known IFR and fatality-count (and assuming that fatalities are statistically independent), the total infection count is given by a Negative Binomial distribution (see below)-- we can thus estimate the histogram of infecton counts in Germany by averaging samples from a Negative Binomial distribution over the uncertainty in the IFR:

Using the posterior distribution for the IFR estimated above (using the beta prior with mean 1%), this yields the above histogram -- as we can see, there is fairly wide range of infection counts consistent with the data and the IF and IFR estimates of the Gangelt study, ranging from 0.7 million people to almost 4 million people. The posterior mean (2.08 Million) is also slightly different from the one stated in the Gangelt study (1.8 million), but both are well within the high-probability region.

If we use the current number of fatalities in Germany (7824, taken from the Robert Koch Institut as of May 15th, 2020), we can get an updated distribution over plausible infection counts for Germany (mean: 2.47 million, 0.87-4.68 million):

Discussion¶

We showed how one could use a Bayesian approach to obtain estimates of the IFR with error bars, and how one could propagate this uncertainty to extrapolations based on this IFR. We emphasize that the analysis above rests on several distributional assumptions. These include:

We assumed that the estimation of the infection rate, and its uncertainty by Streeck et al is accurate-- we here did not attempt an Bayesian approach to estimating it, but rather used their estimate in the subsequent analysis.
Furthermore, we assumed that a Beta distribution is an appropriate distribution to capture the uncertainty in the IR.
We assumed that infections in Gangelt are independent from each other, conditional on the infection rate.
We assumed that fatalities in Gangelt are independent from each other, conditional on the fatality rate.

In addition, for the extrapolation to infection counts in Germany, we additionally assumed that:

both the recorded number of fatalities in Gangelt (Formula) and in Germany (Formula) are accurate, and that
the IFR in Gangelt and Germany are the same (Formula).

Violation of any of these assumptions, or any combination of them, could have a substantial impact on the results. There are reasonable grounds to challenge several of them, e.g. the assumption that infections occur independenlty from each other, as do fatalities.

We do not take a position on whether these assumptions are appropriate or accurate. However, we would expect that violations of any of these assumptions will generally lead to an increase in the resulting uncertainty. Thus, we would expect that a more conservative analysis (which e.g. models fluctuations in infection rates or statistical dependence in fatalities, e.g. by marginalizing over the associated uncertainty) would likely lead to substantially bigger error bars.

In addition, we emphasize that the results of any Bayesian analysis depend on the prior distributions. In our case, this is most prominently the case for the prior on the fatality rate. Empirically, we showed above that our inference on the posterior distribution over the IFR --- i.e. the error-bars on the IFR--- depended only weakly on the priors we investigated. In general, one could interpret this dependence on the prior as a weakness of the Bayesian analysis. However, one could also argue that it is precisely an advantage of the Bayesian approach that the dependence of the inferred parameters on prior assumptions is made explicit, and that this dependence can be explored quantitatively.

In general, we found that, for all priors investigated, the error bars were substantial, and IFR from beween 0.12% and 0.74% were consistent with the observed data. These ranges are not substantially narrower than prior reports of the IFR-- this is consistent with the naive intuition that any estimate of a fatality rate from a local population would depend on a small number of events (here: 7 fatalities), and therefore must necessarily be of limited reliability. Importantly, however, even weak constraints on the IFR from individual studies (with random sampling) can be highly valuable, in particularly if they are pooled across multiple studies as are currently under way -- by combining the results of multiple such studies, one can both aim to more strongly constrain IFRs, as well as to estimate regional variations in IFRs. For example, see here for a Bayesian meta analysis of the seroprevalence. Such meta-analyses benefit strongly from openly shared data statistical and analysis-protocols.

It has been stated that the goal of the Gangelt study was to provide "constraints for models". As pointed out above, given limited data, no single such study would be able to fully constrain parameters-- multiple parameter-settings might be plausible given the available data. This is particularly evident given the very example provided by the example of how these numbers could be used in a "theoretical model" to extrapolate to an infection count in Germany. The results of any such extrapolation are strongly affected by the uncertainty in the underlying parameters.

These observations highlight a more general point -- given that it is (rarely) possible to fully constrain any parameter of a model or simulation, it is advisable to not over-interpret simulation-results from any particular model. Rather, any simulation-results should ideally be interpreted by considering an ensemble of models which are generated from different parameter-sets, each of which is consistent with the observed data. Recent methods for simulation-based inference, (including our own work), can be used to directly perform inference over the parameters of numerical simulations, and could be useful for constraining large-scale simulations of epidemiological models, as e.g. suggested in a recent preprint.

Appendix¶

Detailed definition of model¶

We assume that the infection rate in Gangelt can be modelled as a Beta distribution with parameters $\alpha_\theta$ and $\beta_\theta$ which are chosen such that this prior has the desired mean and variance, using the fact that

$$ \mbox{E}(\theta) = \frac{\alpha_\theta}{\alpha_\theta+\beta_\theta}$$

and $$\mbox{Var}(\theta) = \frac{\alpha_\theta \beta_\theta}{(\alpha_\theta + \beta_\theta)^2(\alpha_\theta + \beta_\theta+1)}.$$

We assume that infections in Gangelt occur independently from each other at an infection rate $\theta$, i.e. that the total infection count $I_G$ (for fixed $\theta$) is given by a binomial distribution,

$$ I_G | \theta \sim \mbox{Binomial}(n_G,\theta),$$

where $N_G$ is the total number of inhabitants in Gangelt. Averaged over the uncertainty in the infection rate, this results in the infection count in Gangelt given by a beta-binomial-distribution,

$$ I_G \sim \mbox{BetaBinomial}(n_G, \alpha_\theta,\beta_\theta )$$

.

We assume that, amongst infected individuals, fatalities occur independently at infection fatality rate $\lambda$, i.e.

$$ F_G | I_G \sim \mbox{Binomial}(I_G, \lambda),$$

which (marginalized over $I_G$) yields a likelihood $P(F_G=f_G| \lambda$), with observed fatality-count $f_G= 7$.

Thus, the likelihood corresponds to a "thinned" beta-binomial distribution (see Wiuff & Stumpf 2006 for a discussion of which distributinos are closed under binomial subsampling).

Finally, we need a prior distribution over $\lambda$. We choose a Beta-distribution with parameters $\alpha_\lambda$ and $\beta_\lambda$,

$$ \lambda \sim \mbox{Beta}(\alpha_\lambda,\beta_\lambda).$$

Different numbers of fatalities¶

So far we used a number of seven fatalities in Gangelt, as in the results of the study. However, the paper reports that at least one additional death occured between the end of the reporting period and the publicaton of the preprint. Here we also run the model with different numbers of fatalities in Gangelt and see how much this influences results.

Changes in fatality-count would substantially change extrapolated infection count for Germany: it increases from 2.07 million to 2.47 million (6 instead of 7 fatalities) or decreases to 1.79 million (8 instead of 7 fatalities).

Comparing beta and normal priors to model study estimate of the infection rate in Gangelt¶

The Heinsberg study estimates the infection rate in gangelt to be 0.1553 [0.1231, 0.1896]. So far we modeled this as a Beta distribution, which is a natural choice for a distribution over probabilities. However, people from outside Bayesian statistics might have used a normal prior with same mean and variance. Here we check whether this influences results.