Thursday, April 23, 2020

Corona 4: infection rates and cool graphs

Covid-19: the fun never ends. So you've heard about the lockdowns beginning to "ease" in Europe, the redneck / right-wing nut protestors in the US, and definitely -- for a while now -- about flattening the curve. But how, other than going stir crazy inside and politicians and business owners getting worried about re-elections and profits, respectively, how can we know when it might be safe(-ish) to start lifting restrictions, going back to "normal" (whatever that will be), and -- most importantly -- knowing how and when the virus might be starting to go away? The answer is (or more accurately, part of the answer is): the infection rate (aka "Rt: the effective reproduction number", even better aka "The Metric We Need to Manage COVID-19").

Rt is, as the name of this article suggests, the "effective reproduction number" or "infection rate" and in this case we of course mean for Covid-19 (what else would anyone be talking about these days?). Similar in some ways to the replacement fertility rate, which is the average number of children that would be born to a woman over her lifetime required to sustain a given population, Rt can tell us what we can expect to happen in a population -- but in this case, how the virus might keep spreading or start to slow down. For replacement fertility, the important number is right around 2.0 (it's a little higher due to some other factors like child mortality rates but for our purposes and in general for the developed world you can think of it as 2.0). If the number is < 2.0 in a country, eventually (without immigration), the population will disappear. If it's > 2.0 then the population will grow. Easy, right?

It's similar with the effective reproduction rate. But in this case, it tells us the average number of people that an infected person will infect. If it's > 1.0 then the virus will keep spreading. If it's < 1.0, then the virus will eventually disappear. As Tomas Pueyo put it in this excellent article The Hammer and the Dance"If R is above 1, infections grow exponentially into an epidemic. If it’s below 1, they die down." Let's look at a simple example:

1. Imagine 10 people have the virus and we are experiencing Rt = 2.0. Then those 10 people will each infect (on average) 2 other people. So then we have 20 new cases (eventually the original 10 people will get better, we hope, or, sadly, will stop being counted as "active cases" in another way). Those 20 people, with the same infection rate, will each infect another 2 people each, and now we have 40 new cases. Et cetera...

2. Now imagine that 10 other people have the virus, but because of social distancing, self-isolation, and other "restricting" measures, the Rt has been pushed down to, say, 0.85. Those 10 people will then infect a further ~8.5 people (we can round up to 9 to be conservative), who will then infect 9*0.85 = 7.65 (or around 8), who will then infect 8*0.85 = 6.8 (or around 7), until, eventually, there's no one being infected.

So, obviously, the point of the lockdown measures (among other things) is to push Rt down towards 0. And we can only consider opening things up if we are confident that we can keep Rt below 1. From The Hammer and the Dance again:
In Wuhan, it is calculated that Rt was initially 3.9, and after the lockdown and centralized quarantine, it went down to 0.32.
But once you move into the Dance [easing lockdown measures], you don’t need to do that anymore. You just need your Rt to stay below 1: a lot of the social distancing measures have true, hard costs on people. They might lose their job, their business, their healthy habits…
As long as you can keep the infection rate under 1 until there's a vaccine, then we can prevent another outbreak. With the background in place, we can now finally get to the big question and (hopefully) some answers: what is Rt currently in different places, what has it been in the past, and where is it going?

What is Rt right now?
Unfortunately, as you can probably imagine, we can't compute an exact value for Rt. It is a latent (hidden) variable. We can observe the effects of the infection rate, though, which is (of course) the number of people being infected. The way one does this is (duck for cover): statistics! Probabilistic inference! And most exciting of all, Bayesian inference! What a coincidence that the author of this blog's PhD thesis was on this very topic! Now, there are very few things in the world that tend to be as boring and/or annoying as someone explaining their PhD thesis work, so I will try to keep the explaining part to a minimum, and the cool results part to a maximum. Sadly, there is some background that I just can't help myself but feel that I have to lay out here. It will be super basic though.

Bayesian statistics uses Bayes theorem to understand degrees of belief in events happening. The main important thing that distinguishes it from frequentist statistics is that we infer probabilities of events based not only on running many trials and building up statistics, but on our prior belief (encoded as a probability) that something will (or will not) happen. We're mainly interested in computing posterior probabilities which (thanks to Bayes theorem) are understood to be proportional to the likelihood of our data given some proposition being true, times our prior belief that the proposition in question really is true. Tiny obligatory simplified example of coin tossing time. In pure frequentist statistics, if we flip a coin 5 times and each time it comes up heads, our model will predict that this coin will be heads 100% of the time. In Bayesian statistics, if we encode a prior belief that, like most coins, it will come up heads 50% of the time, then our model will have a more sensible prediction that (depending on the likelihood and prior functions that we chose) will be closer to predicting that heads will come up maybe a tiny tiny bit more often than tails. If we keep running experiments and after 1000 trials (for example) we still see heads coming up like 950 times, then the evidence (aka likelihood) will overwhelm our prior belief and we'll again come to the conclusion that this coin is not very fair and is in fact weighted to come up heads a lot more often.

It's well known that we are not able to count every single person who gets infected (for example, recent research suggests that Covid-19 may be between 50 and 85 times (!!) as prevalent as official figures suggest). Thankfully, however, as long as we use the same method to count the number of people infected, it doesn't really matter for our purposes if it over- or under-estimates the raw numbers -- as long as it's more or less consistent.

Our method
This is great because I love Instagram so much: the guys who created Instagram have created a website, rt.live, which not only calculates values of Rt for all of the US states (and graphs them really nicely), but also explains their method and provides some code that we will use to extend their analysis to places that readers of this particular blog may care more about; namely, Germany and Canada.

As Kevin Systrom does at rt.live, we will use a variation of the Bettencourt and Ribeiro method (probably no relation to the Milwaukee Admirals sniper Mike Ribeiro). This method uses Bayes' Theorem with the following setup:

PRt | k ) = P ( k Rt )  P ( Rt ) / P ( k )

This gives an expression to calculate the probability of Rt given k new cases, or P ( Rt | k ). This is the posterior probability (see above) that we're interested in. This is equal to:

- The likelihood of seeing k new cases given an infection rate of RtP ( Rt ), multiplied by
- The prior probability (or prior belief) of what Rt might be before we've seen any data, P ( Rt ), divided by
- The probability of seeing this many new cases in general, P ( k ).

Oh dear, this is already getting a bit complex. But please bear with me for just a little bit longer and then we'll get into the nice graphs, I promise. I will spare you the actual inference calculation, but we need to at least understand what we will use to model each of the expressions mentioned above.

First, the likelihood function. Since we're modeling counts of something, we will use the Poisson Distribution. From our friend wikipedia, this models the "probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event". It has a single parameter λ (lambda) which is the expected number of occurrences. After some derivation, we can connect λ with k (number of new cases) and our famous rate Rt. It's like this:



which can then be plugged into the Poisson Distribution and we get a nice formula for our likelihood:


Remember that we know kt-1 (it's the number of new cases on the previous day). Those who have studied biology will probably recognize our expression for λ as a standard exponential growth model. Note that the γ is the reciprocal of the serial interval (γ  1/7 for Covid-19).

So what does this likelihood function actually look like? Consider an example where we have 4 days of data: on day 1 there were 20 new cases, day 2 had 40, day 3 had 55, and day 4 had 90. Then, the likelihood of Rt for each of these days looks like this:



Next let's look at our prior. Remember this is our prior belief how likely values for Rt without having observed any data. Because the value of Rt for one day will probably be very close to its value on the previous day, we use a normal (Gaussian) distribution with a mean equal to the value of Rt on the previous day. Therefore our prior looks like this:



where σ is a hyperparameter that we can estimate (or just use some "sensible" value) and is the Gaussian's variance. An example prior distribution P ( Rt ) with Rt-1 = 3.9 (an estimate of what the initial value was in Wuhan) and σ = 0.3 (we can choose a good value by maximizing the likelihood of our observed data) looks like this:


What this means is there's a very high probability of the new value being around 3.9 again (in fact it is the most likely value) but there's quite a spread of possible values (and getting exactly 3.9 is actually extremely unlikely). For example, if we integrate the probability distribution function above from 3.0 to 3.5, this tells us the probability of Rt falling between 3.0 and 3.5, and that probability is 18.24 %.

Finally, we're ready to combine the prior probability with the likelihood function and perform inference with our observed data. All we have to do for that is multiply the prior by the likelihood and normalize.

Applying the model to real-world data

The Worldometer is providing up-to-date numbers for new confirmed cases of Covid-19 across the world. I gathered the data for Germany and Canada on Thursday, April 23rd from https://www.worldometers.info/coronavirus/country/germany/ and https://www.worldometers.info/coronavirus/country/canada/ respectively. Here are the total number of cases for Germany and for Canada starting on March 2nd:



Just looking at the graphs above, while Germany has a lot more total cases than Canada (almost 4x as many -- Germany has ~150,000 while Canada only has ~40,000), it looks like right now Germany's cases are growing more slowly, and therefore likely has a lower Rt. Let's find out!

We can see more clearly in the following graphs that indeed Canada's new daily case numbers are currently going up, while Germany's are going down (we also show a "smoothed" curve -- and use that in our model calculations -- because it helps control other unseen errors in the observed numbers like people who didn't get counted, etc.).



Again, the raw numbers are bigger in Germany, but the fact that those numbers are going down is what's important. Let's see those infection rates!!!




The data points show the maximum likelihood estimates of the infection rate for each day and the grey shaded area shows 90% of the probability density interval of our posterior. You can see that we get more confident of the estimate range as we gather more data. In both cases, the good news is that the trend for the infection rate is clearly on the way down. For Germany, as the initial data suggested, the infection rate is slightly lower than it seems to be in Canada and perhaps it is enough for a very cautious gradual re-opening of the economy (but I personally doubt it).

The graph looks nice, but here are some more specific numbers for the last few days, first in Germany (left), and then in Canada (right):

The tables above show the maximum likelihood (ML) estimate -- i.e. the most likely value for Rt followed by the low end of the 90% probability density interval and then the high end. For Germany, our best guess for the current infection rate is 0.77 and is almost for sure not lower than 0.49 and almost for sure not higher than 0.98. For Canada our best guess is 1.08 and it's almost for sure somewhere between 0.81 and 1.30.

So does this mean that things can go back to normal any time soon? I'd say... definitely not.