The Opioid Files
Category: Best data-driven reporting (small and large newsrooms)
Country/area: United States
Organisation: The Washington Post
Organisation size: Big
Publication date: 16/07/2019
The Opioid Files for the first time identified not only the counties flooded with the highest amount of prescription opioid pills at the height of the prescription drug crisis, but the specific manufacturers, distributors and pharmacies that were responsible for bringing those pills into communities. The Post found that over a seven-year period from 2006-2012, over 76 billion pills of hydrocodone and oxycodone were shipped to pharmacies across the country, more than enough for one pill per person per day in some communities.
The database was the largest that The Post has ever published, containing 380 million records. We made it searchable for the public and other journalists, generating at least 150 stories in 35 states by other media outlets (133 local and 17 national) and more than 50,000 downloads of the data by individuals interested in doing their own digging.
The outlets included the Philadelphia Inquirer, the Detroit Free Press, Minneapolis Star Tribune, the Boston Globe, the Chicago Sun Times, the Arizona Republic, the Columbus Dispatch, the Tampa Bay Times, the Fort Lauderdale Sun-Sentinel and the Portland Oregonian.
Many smaller outlets wrote as well, from the Daily Mountain Eagle in Alabama to the Paintsville Herald in Kentucky to Wenatchee World in Washington state.
The documents The Post obtained shed light on the industry strategy to expand the market and fight DEA’s attempts to hold companies accountable.
The documents provide answers to the enduring mystery of how the drug companies were able to weaken the DEA’s most powerful enforcement weapon at the height of the crisis, by enlisting member of Congress and developing “tactics” and a “Crisis Playbook” to aimed at undermining the DEA.
Upon receiving the ARCOS data, our first challenge was finding a way to parse the large dataset. We used wrote an ETL pipeline with unix, Python and R scripts that broke the massive CSV file into smaller chunks, converted the files into Apache Parquet files (a columnar-based data format used for large-scale data analysis), and loaded the data into memory as needed.
We fine-tuned our scripts to run everything in parallel (using the pandas and dask Python libraries) which allowed us to reduce the time of our analysis from hours to minutes. This process allowed us to quickly iterate on new ideas as the story unfolded without waiting for all the data to load. And by doing the analysis in both Python and R, we were able to audit each other’s analyses and make sure our methodologies were sound. We then published the county-level data on Amazon S3 to allow other reporters and researchers to download the data.
Next, we geocoded (generated coordinates based on an address) every single pharmacy address in the database. (We had to do this manually for at least 3,000 pharmacies.) Then, with node.js, we pulled U.S. census data from IPUMS and analyzed the number of pills distributed within a 5- to 10-mile radius using buffers generated with turf.js.
In sum, the project realized several technology stacks and analysis approaches to dig into the data.
What was the hardest part of this project?
We originally filed a Freedom of Information request to the DEA, but the agency did not provide the data. The Post then intervened in a civil lawsuit against two dozen drug companies and pharmacies in Cleveland to gain access to the data.
It took more than three years and the intervention in the opioid lawsuit to obtain the data. The Post could not find an attorney at any of the large D.C. law firms because they were already representing drug companies or pharmacies in the case.
We were able to hire a sole practioner from Akron, Ohio, who successfully argued to the 6th Circuit of Appeals that the DEA data, along with internal company documents in the case, should be unsealed and released to the public.
On July 15, when we eventually received the data, it contained 380 million transactions. The Post made an emergency purchase of a custom-adapted Dell Precision 5820 Workstation. Working around the clock, we produced our first story two days after the data was released.
But the stories could not have been written without old-fashioned shoe-leather reporting and the careful cultivation of sensitive sources. We needed expert guides who could explain the numbers and point us to the most compelling documents, out of the tens of thousands that were released from the lawsuit, including depositions and internal emails.
More than that, we needed the resources of the entire Post newsroom, eight departments working together, to create a public-facing interactive database and an online repository for the most important documents. We did this under enormous deadline and competitive pressure.
What can others learn from this project?
This project would not have happened without sources and expert legal assistant. Our advice would be to cultivate sources deep inside the agency or company you are investigating to understand how the place works and what kinds of documents are available. Our sources were able to to tell us about the existence of the database at the DEA, the kind of information it contained and what that information might reveal about the opioid epidemic. We also relied heavily on legal counsel both inside and outside of The Post. Our outside lawyer filed to intervene in the federal litigation and convince a U.S. appeals court in Ohio to release the database and unseal tens of thousands of corporate emails, memos and other documents.