2021 Citation

Tracking the Coronavirus

Country/area: United States

Organisation: The New York Times

Organisation size: Big

Publication date: 3 Mar 2020

Credit: The New York Times

Jury’s comments:

A huge investment of people and resources, even for the New York Times. The judges were especially impressed with the way the Times made their considerable data efforts an open resource, and became part of a broader data-sharing community. Good data was as critical as it was in short supply, and the Times led the industry in sharing it. The Times’s Covid GitHub repository has, at this writing, some 2,800 forks and more than 1,800 commits.

Project description:

In the vacuum left by the lack of a coordinated federal effort to disseminate data on the pandemic, we launched a project that would become the newsroom’s most ambitious data-tracking effort ever. We were one of the first organizations to provide county-level data on cases and deaths. We also compiled authoritative databases on case clusters at nursing homes, food-processing facilities, prisons and colleges. Our efforts focused on topics of public importance where timely, reliable data wasn’t available from the government. The data were presented on over 80 tracking pages, used in stories and made available to the public on Github.

Impact reached:

The project had a tremendous impact on many levels. Readers tell us about how they check the tracking pages every day, which are among the most-viewed pages we have ever published.

We decided early to make the county-level database freely available, extending the reach of the data far beyond the pages of The Times. It has been cited in more than 60 peer-reviewed scientific papers. It has been cited by federal agencies, like the Council of Economic Advisers, CDC and Department of Veterans Affairs, which said it would be used to provide better acute care in rural areas. Local and state officials cited the data in formulating policies. Health care companies said they were using the data to manage patients. Other companies said they were using the data to help inform when to return to the office.

The data has been cited in many news publications, and Google uses the data to power its U.S. and state dashboards whenever anyone searches for “covid cases.”

Our cluster data revealed how colleges drove Covid into areas that had been relatively virus free. It has revealed that about a third of Covid deaths are linked to nursing homes. And it has shown the wide racial gap in who gets sick and who dies.

When we began tracking nursing home cases, most states were not identifying affected facilities. We filed dozens of public records requests to surface this information. When we began tracking college cases, very few were proactively disclosing numbers. We filed more than 200 public records requests, and by mid-fall most colleges were sharing data. To analyze racial disparities, we filed a Freedom of Information Act request and successfully sued the CDC for access to its internal database. In total, we filed more than 400 public records requests for virus data.

Techniques/technologies used:

The county dataset of cases and deaths are stored in a PostgreSQL database. Node.js was used to create a data admin for doing QA and managing multiple sources of data for every geography, as well as to create a suite of more than 300 scrapers to collect the data. The nursing home system extends Google Sheets to use as a relational database, with both automated scrapers and news assistants entering data for individual states and facilities. After new data is collected, normalized, and fact checked, a custom Node.js script integrates the new data into the database.

The data are frequently pushed to thousands of pages on nytimes.com, which use JavaScript (D3 and Svelte) to dynamically generate text and visualizations, which explain the latest state of the outbreak in every state and county, in nursing homes, colleges, prisons and more. Google Drive, Google Documents, Microsoft Excel, Microsoft Access and R were also used to manage data and do analysis.

Some of the data collected through The Times’s survey process has also been analyzed by Times journalists and joined with other large datasets to produce investigative stories. Journalists joined The Times’s long-term care database to the federal government’s database, as well as to data on the racial make-up of nursing homes. They used regression analysis to determine whether there were patterns in the homes that had cases or deaths. Journalists joined the county-level database with Census data to identify counties where college students comprised at least 10 percent of the population. The journalists then took that list of about 200 counties and cross-referenced it with The Times’s college cases database. The resulting stories showed that college campuses were driving the spike in cases in the early fall, and that college outbreaks likely led to deaths in the wider community.

What was the hardest part of this project?

The Times has several internal databases dedicated to tracking U.S. coronavirus cases at the county level and clusters at facilities. Each database has strict methodology for entry and verification, developed by Times journalists.

The county-level case database, vaccine database and nursing home database were initially manual operations, with a team of journalists checking public websites and Twitter feeds or contacting state and county governments and individual facilities. Both are now powered both by manual collection and computerized processes created by Times developers. The college, prison and cluster databases use manual collection only, and often involve extensive email or phone correspondence with government officials and business representatives. The collection effort includes dozens of journalists, who survey government entities and private facilities, collect and fact-check data, build and run automated collection systems and present the work in text and visually.

With little federal guidance for reporting, nursing homes, prisons, colleges and even local health departments often define and report cases and deaths in different ways. The Times has developed strict methodology to make sure cases and deaths are counted in a consistent way. For example, some colleges report positive tests rather than unique positive cases. Because it is possible for individuals to test positive multiple times, journalists have asked every college that reports positive tests (there are hundreds) to confirm the number of unique cases. If the college is unable to confirm a number or refuses to respond, those cases are removed from the total number of cases tied to colleges overall, and those colleges have been marked clearly in the online tracker as having potential duplicate test results. The Times also has strict fact-checking rules in place to make sure cases and deaths are sourced properly and checked by multiple people before publication.

What can others learn from this project?

Problems in the data frequently led to story ideas. The county-level data did not contain enough information to analyze racial disparities on a national level. So, The Times filed a Freedom of Information Act request and eventually sued the C.D.C. for an anonymized database of individual confirmed cases along with characteristics of each infected person. The C.D.C. provided data on 1.45 million cases reported to the agency by states through the end of May. Many of the records were missing critical information The Times requested, like the race and home county of an infected person, so the analysis was based on the nearly 640,000 cases for which the race, ethnicity and home county of a patient was known. The data allowed journalists to measure racial disparities across 974 counties, accounting for about 55 percent of the nation’s population, a far wider look than had been possible previously.

Like many newsrooms across the country, The Times undertook this sprawling data collection and reporting effort remotely, with dozens of journalists working from laptops in their homes on shared databases. As cases and deaths have swelled, the databases housing this information have buckled under their weight multiple times, requiring creative solutions. In mid-March, the database, a shared Google spreadsheet, suddenly stopped functioning because it had too many cases and too many people working in it. Journalists quickly had to redesign the database, splitting it in two parts, in order to keep the tracking effort going. The nursing home database alone has undergone multiple redesigns because there have been so many cases and deaths at so many facilities.

Project links:

www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html

Nursing Home Covid Tracker – first published May 11, 2020, and updated regularly since then: www.nytimes.com/interactive/2020/us/coronavirus-nursing-homes.html

College Covid Tracker – first published July 28, 2020, updated regularly since then: www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html

The Fullest Look Yet at the Racial Inequity of Coronavirus – published July 5, 2020: www.nytimes.com/interactive/2020/07/05/us/coronavirus-latinos-african-americans-cdc-data.html

Track Coronavirus Cases in Places Important to You – first published Nov. 24, 2020: www.nytimes.com/interactive/2020/us/covid-cases-deaths-tracker.html

How Full Are Hospital I.C.U.s Near You? – first published Dec. 16, 2020: www.nytimes.com/interactive/2020/us/covid-hospitals-near-you.html

See How the Vaccine Rollout is Going in Your State – first published Dec. 11, 2020, updated regularly since then: www.nytimes.com/interactive/2020/us/covid-19-vaccine-doses.html