In early April 2020, we published the world’s first international comparison of covid-19 excess mortality, after tip-offs from people in Italy and France that official death tolls were undercounting the actual number of fatalities. Later that month, we published the first interactive tracking page for excess mortality, showing data for several countries. In May, we were the first organisation to publish all of this data on GitHub, along with our sources and code. We have updated the database regularly since then, while expanding our selection of countries.
Excess mortality is the best way to compare the impact of covid-19 across countries. Varied levels of testing, especially in developing nations, can make official death tolls unreliable. In April, no organisation was collecting this data internationally.
So we decided to do it ourselves. We put the tracker in front of our paywall, to act as a public service. We also published all of our data, sources and code on GitHub. We wanted academics to trust our work, and to give them the tools they needed for their own research. According to Google Scholar, our work on excess mortality has been cited in more than 120 academic articles.
This has shaped global awareness of excess mortality. After we published our tracking pages, other organisations (such as the New York Times and Financial Times) produced similar work. Governments also took notice. When we started out, only a handful published regular mortality data. Now, dozens do.
One notable example of our journalism encouraging greater transparency was in Mexico. The health ministry cited our tracker in official briefings (see this link). It then started to publish data in a similar format, replicating our heatmaps (see this link). Mexico has one of the world’s highest rates of excess mortality, which is 150% greater than the official covid-19 death toll. It would not have been possible to establish this without the health ministry releasing this data.
To gather the data, we used a combination of R scripts and lots of manual labour. For some countries, we could automatically download data and clean it into the right format. For others, we had to access spreadsheets by hand, or even copy charts from official websites into a machine-readable format.
Once we had data in clean CSVs, we trained regression models in R that could predict a baseline of expected deaths in each week or month of 2020, based on national trends from recent years.
The graphics were written with React and D3.js. We think that one of the things that sets our tracker apart from others published by our competitors is that we made every chart interactive. We believe that in a topic as complicated as this one it’s even more important to be transparent, so we added tooltips to every chart, showing expected and total deaths for each point. We also added a toggle to switch between deaths per 100,000 people and absolute figures.
The tracker has been changing too: at the beginning of the pandemic it was mostly a grid of line charts, but after doubling the number of countries we redesigned the page. We added a heatmap with country and regional data and a little explainer that walks the reader through the concept of expected and total deaths.
What was the hardest part of this project?
There were several challenging aspects of this project. Gathering the data in the first place required lots of investigating of government websites, many of which are in languages other than English. We sent emails to several possible sources, and asked The Economist’s correspondents from around the world to help track down information. Wrangling the numbers from different countries into a consistent format took a lot of effort.
Refreshing the data and adding new countries has also proved tricky and time consuming. Throughout the pandemic, we have tried to keep the page as up-to-date as possible. We have also answered several queries from readers.
Even after spending dozens of hours making the data consistent, we spent a considerable amount of time getting the little details right on the visualisations. Most countries publish weekly data but some only publish monthly files. Some countries don’t have nationwide data available, only major cities. We ended up adding multiple footnotes and different code paths to account for these.
What can others learn from this project?
Hopefully this project has demonstrated the importance of publishing data and code on GitHub, since it has allowed academics to interrogate our work, and also to use it in their own research. This project has also shown the benefits of automating work with R scripts. Updating every single source each week by hand would have been very cumbersome; increasingly, we can gather most of the data by directly downloading CSVs from government websites.