MIT Technology Review conducted a months-long data investigation into the Justice Department’s China Initiative, a controversial effort to counter Chinese economic espionage, creating the first searchable database of cases. The initiative had been criticized for racial profiling, its chilling effects on science, and general ineffectiveness in catching economic spies. However, prior to our investigation, it had not been subjected to a data-driven analysis—partly because DOJ does not make the data easily available. We found that many cases have little connection to trade secret theft; an increasing number target academics; and nearly 90% of defendants charged are of Chinese heritage.
In the month since publication, our story has become the definitive source for data on China Initiative cases, filling a gap left by the Justice Department. Our investigation, or responses to it, has been cited in nearly every subsequent story on the initiative. On Jan. 10, for example, the Boston Globe editorial board called for the DOJ to end its targeting of scientists, citing extensively from our data and findings.
Even before we published, our investigation prompted the Justice Department to update its public webpage of cases–the only repository for information related to the initiative– for the first time in five months.Two days after we approached the DOJ with questions, the department removed 17 cases and 39 defendants from its China Initiative page, added two cases, and updated others with sentencing and trial information, where available.
According to lawmaker Judy Chu, a Democrat from California, our findings have answered questions that she has posed to law enforcement and intelligence officials—without luck. “Whenever we ask for data,” she told us when we briefed her on our initial findings, neither the FBI nor Justice Department officials “give it back to us. What you have are numbers, and it is startling to see what [they] are,” she said.
Our database has also changed the minds of some of the officials involved in the China Initiative’s design. Andrew Lelling, the former U.S. Attorney for Massachusetts that served on the steering committee which shaped the Initiative, shared our story in a LinkedIn post and called for the DOJ to “revamp, and shut down, parts of the program, to avoid needlessly chilling scientific and business collaborations with Chinese partners.”
“MIT Technology Review did a great job breaking down the Initiative’s record, three years in,” he wrote.
We scraped the DOJ webpage on the China Initiative announcing arrests and indictments. We then added details based on a review of thousands of pages of federal court documents, DOJ public statements and Congressional testimony, and interviews with defense attorneys, family members, and others linked to the defendants, to build the first searchable database of China Initiative cases. Using AirTable, we initially built two linked datasets: one organized by case, and one organized by defendant, which allowed us to easily analyze how the Justice Department was carrying out these prosecutions while also examining the people impacted by the initiative. We then combined our two datasets to create a public, searchable database. We also used Google Sheets for data analysis. We cross-checked our dataset with other case data that has been collected privately by civil rights groups. We used Datawrapper to visualize some of our main findings.
To track the change that the Justice Department made on its website, we used the Python library “wayback” to download all historical copies of the Department of Justice China Initiative press release index page from the Internet Archive. With snapshots of the page from February through June 2021 in hand, we compared these and logged changes to the press release index page over time. We used the Wayback Machine’s new changes tool to visualize the number of changes that the Justice Department made in response to our reporting.
What was the hardest part of this project?
The biggest challenge was deciding what to do with the limited—and changing—data available. Our main data source was the Department of Justice’s China Initiative webpage, which we knew to be an incomplete record. The Justice Department deleted and added press releases without explanation—and sometimes, after a case previously touted as a success story for the program had fallen apart.
criteria for what data to include, especially since the main data source—the Department of Justice’s China Initiative webpage—would delete data (press releases) without explanation. Because the cases were quite different, we also put a lot of thought into how to come up with the most accurate categories and labels. To explain how we decided what to include, and why, we wrote an accompanying essay on methodology, the changes that the Justice Department made after we published, as well as our own disclosures of where there could be perceived conflicts.
What can others learn from this project?
Our biggest lesson learned is that even limited datasets can be useful, as long as limitations are clearly explained. In fact, the fact that there are data gaps can become key parts of the story. In our case, the fact that the Justice Department made such big changes in response to our questions further illustrated our point about the program’s troubling lack of transparency. Another lesson is that data investigations don’t always require very sophisticated tools or techniques to do a data investigation. Additionally, whenever possible, make the dataset itself available—and searchable—for increased utility and impact, and invite collaboration when possible. For us, that collaboration took the form of sharing and comparing notes with outside groups also conducting their own research, as well as including a way for readers to get in touch, which has led to submissions of new data (additional court cases) that we may include in a future version of the database.