In May 2021, The Economist became the first and only organisation in the world to publish an estimate of the pandemic’s true death toll, as well as estimates of excess deaths for every country in the world. These are, as far as we are aware, the only estimates of excess deaths—the best metric of pandemic mortality, with no dependency on testing—available at the global level.
The wider project included a cover “briefing” and a leader calling for more equitable distribution of vaccines, as well as multiple follow-up pieces, an online methodology, and daily-updated interactive tracker.
As SARS-CoV-2 spread across the world, it was quickly apparent that some places were better at detecting it than others. Be it in infections or deaths, the biggest difference has always been this: victims in richer countries are more likely to be tested, and therefore more likely to be counted.
This matters, because for important questions, such as where to allocate aid or vaccines, or even where to accept travellers from, one needs an accurate picture of the pandemic globally. Such a picture is also important to identify which regions remain vulnerable, and to identify and acknowledge the pandemic’s devastation beyond the rich world. If one relies on official covid-19 death counts, the result is predictable: most victims in poor countries will not be counted, and these countries will not receive help.
Our project aims to estimate and show a more accurate representation of the pandemic’s true death toll. The effort has been widely praised by international organisations, such as the WHO, who called it ‘heroic’; the UN who called it “exemplary”; and it has been used as the basis for analysis by the World Bank, as well as being acknowledged the Global Fund, one of the largest distributors of pandemic aid. Researchers at the University of Oxford call it: “the most comprehensive and rigorous attempt to understand how mortality has changed during the pandemic at the global level.”
The project’s estimates have been widely used, and have even become part of Our World in Data’s covid-19 “data dashboard”. We know that at least one of the world’s largest aid organisations has used our findings in discussions about how to allocate pandemic aid. They are also being used by the WHO and our journalists are regularly advising the WHO on related questions upon request.
For the modelling powering the effort, we relied on cutting-edge statistical procedures and implementations. Specifically, the modelling begins with a massive script collecting data on more than 100 statistical indicators, ranging from cell-phone data to researchers’ categorisations of countries into democracies and non-democracies to the share of covid-19 tests that are reported as positive.
This information is then fed into a relatively new implementation of gradient-boosted trees, a flexible machine learning algorithm, not one, or two, but 201 times. This enables the formation of 201 models, of which 200 are there to accurately produce confidence intervals, showing where the models are uncertain, and giving ranges rather than implausible precise numbers.
Every morning, the data are updated automatically on Github, predictions are generated from all 201 models, and these are then passed to a series of tests. If all tests are passed, all interactive elements on our pages are updated with the new data, and updated spreadsheets are posted to Github. To learn more, please see the separate methodology article, or take a look at the code, models, and data, which is 100% open-source. (Feel free to suggest improvements too!)
What was the hardest part of this project?
The hardest part of this project was the modelling. Estimating how many excess deaths there have been in every country for every day since January 1st 2020 is, to be frank, very difficult. Second hardest has been communicating the uncertainty of the estimates in a good way.
To take a few examples. The modelling goes beyond just collecting data and feeding them to a model. For instance, data on seroprevalence (the % of people with covid-19 antibodies) in different countries were collected from academic papers. While collections of these exist, whether such surveys were representative had to be decided manually, which involved skimming hundreds of academic articles and government web pages. Lots of countries also had missing data: many countries, for instance, only report updated vaccination counts on irregular intervals. That means that in order for the main models to use vaccination data well, the underlying vaccination data need to be modelled too: a model within a model.
Communicating uncertainty in a good way has been hard too. For all estimates, we have an upper value and lower value (forming a range of plausible values), and a central estimate within that range. For many countries, good data mean a narrow range. That is easy enough to communicate, and if people just use the central estimate within that range, it does not make much difference. In contrast, for other countries we can only give very broad ranges due to limited data. In such cases, just using one number would give a misleading sense of precision. This meant many careful choices in how we write and present our work visually. For instance, we only provide the range in our summary tables – not the central estimate.
What can others learn from this project?
We think that other journalists may get good ideas for how they can use new statistical tools (such as machine learning) in their work, and the importance of providing such information. We have tried to make learning from this project as easy as possible, and keep all our data, code and methods open-source to enable that. Moreover, we have tried to share what we think we have learned from it around the world: university lectures in America, the closing keynote on the use of Machine Learning in Data Journalism at Code.br, the largest data journalism conference in Latin America, and talks to the WHO, Global Fund, and Bill and Melinda Gates foundation.