The COVID Tracking Project at The Atlantic is a painstaking effort to compile more than 800 data points on the coronavirus pandemic from all 50 states, the District of Columbia, and U.S. territories on a daily basis. It is housed within The Atlantic, and we worked with Boston University’s Antiracist Research Center on race data collection. Data scientist Jeff Hammerbacher and Atlantic journalists Alexis Madrigal and Robinson Meyer began this tracking separately before merging their efforts, later bringing on Erin Kissane to co-found the official organization in the first week of March.
We became the definitive, trustworthy source for U.S. coronavirus data, filling the gap left by the federal government. Our data has been used in thousands of news articles and broadcasts across the political spectrum, and in hundreds of academic and medical papers in the most prestigious journals. It’s also used by the largest aggregation efforts and has been cited in multiple pandemic–related lawsuits.
The White House used our testing data rather than the CDC’s, and the Biden transition team relied on CTP data. The CDC published a report stating that our race data might be more accurate than the federal government’s. The CDC’s vaccine advisory council used our data in presenting evidence of who should be included in phase 1A of the vaccine roll-out. Federal lawmakers repeatedly used our data in demanding answers from the executive branch, and cited us in a bill called the Improving COVID–19 Data Transparency Act. We’ve been cited by numerous federal agencies and praised by Dr. Deborah Birx as “superb” and former CDC director Tom Frieden as “invaluable.” The Biden administration used CTP data in its day 1 COVID response plan instead of the CDC’s.
We required states to make data available publicly if they wanted it included in our datasets. We wanted states to report tests in units capturing repeat testing; initially, only 19 states did so, and now all but three jurisdictions report repeat testing. When we began tracking race and ethnicity data, fewer than half of the jurisdictions reported those numbers. Now, nearly every single one does. Our work led to more than two dozen changes to data points on state websites, from correcting errors to clarifying definitions to releasing new data points. And we were instrumental in securing the release—and demonstrating the utility—of new data from the federal government.
The COVID Tracking Project used a unique approach to gather data, relying on a network of hundreds of trained volunteers. The radically transparent project made all of its data and analyses publicly available so that anyone could use it, including an API, downloadable data, dashboards, and highly detailed explanations of the data. We also developed an entire suite of charts that we made available for public use that have been used and replicated by broadcast, digital, and traditional print outlets.
Since its inception, the CTP has built out new and unprecedented processes in order to collect the most complete and accurate data possible. Many other groups tried to “scrape” COVID-19 data automatically, but that method proved unreliable. We deployed human labor in the form of hundreds of trained volunteers to fact-check every single one of the data points, and we built automated systems that run in the background of the data collection process to help those people. We also created a data quality team that engaged in deep research to provide the metadata that was necessary to understand how states were reporting. We relied on a complex workflow combining Slack, Google Sheets, Airtable, and our own databases.
Because the data was so complex, we built out a reporting operation that made hundreds of contacts with state and federal officials to understand their numbers. We were instrumental in shaking more and better data out of states. Working with local reporters, we were also able to apply pressure to states to release more race and ethnicity data, improve the quality of their reporting, and provide details on long-term-care facilities, where a significant portion of COVID-19 deaths have occurred.
What was the hardest part of this project?
The data that the CTP deals with is extremely messy and heterogeneous. It is produced by 56 jurisdictions, each of which has its own data pipelines and reporting quirks. Our team must stitch together this data to create valuable statistics. This requires in-depth knowledge not only of the data that a state provides, but its dashboards, data definitions, and caveats.
State data is imperfect and sometimes erroneous, so our teams throw themselves at the walls of government opacity day after day, trying to shine a light on each of the metrics we track. States themselves don’t always understand what’s happening and we’re often the very first to point out problems. Officials have told us that the federal government used our data to help the Coronavirus Task Force understand its own data.
As a brand-new endeavor, we had to establish relationships with every state, the federal government, public health officials, and others. We built a star-studded advisory board composed of public health experts, epidemiologists, technologists, and experts on racial inequities. We raised $1.5 million from top foundations (Rockefeller, Robert Woods Johnson, Chan Zuckerberg, Emerson Collective) to support the managers of our volunteers and build out our technical infrastructure. We did it with infrastructural support from The Atlantic, but raised the funds ourselves.
The biggest obstacle is the sheer amount of work required to do what the CTP does. This is thousands of hours of work a week that require precision, knowledge, and dedication. We had to build a culture that would bring people in and keep them coming back, despite the ghastly nature of our work. Our data entry team leads have made the CTP a welcoming place that supports those who do the work.
What can others learn from this project?
CTP is a testament to the power of collaborative journalism. We were able to spin up the project quickly and do urgent work by building a coalition of not just journalists, but academics, data scientists, epidemiologists, technologists, and public health experts. It took shape in the crucible of the U.S. pandemic, but could be applied to beats that require cross-disciplinary expertise.
We’ve also shown the ability to work with the public to do journalism. We recruited and trained hundreds of volunteers from all over the country and all walks of life—some of whom had no previous experience with data. We used simple, distributed tools to do this: a Google form, Zoom training sessions, Slack, and Google Sheets. We proved that it’s possible to build an all-remote journalism endeavor from scratch in the midst of a pandemic, creating a culture that kept volunteers engaged.
We’ve built a series of workflows to find, calculate, check, and double-check the data. We plan to share these processes publicly, so that newsrooms that work with messy datasets can learn from our experience.
We also showed just how important it is to analyze how governments are collecting and defining data. Our data quality team and journalists not only worked to come up with definitions and analyses to help the public understand what data is available, but they also held officials accountable for the data they published. The COVID Tracking Project makes all of its data public, along with data definitions, detailed usage guides, and a help desk that’s answered thousands of questions.
Finally, this project is also proof that journalists can do a better job compiling and explaining data than the federal government.