Category: Best data-driven reporting (small and large newsrooms)
Country/area: United States
Organisation: BuzzFeed News
Organisation size: Big
Publication date: 10 Mar 2019
Credit: Jeremy Singer-Vine, Kevin Collier
In 2017, the Federal Communications Commissions was bombarded with millions of fake public comments about its controversial plan to repeal “net neutrality.” It remains the most prolific known instance of political impersonation in US history. Our investigation — based on an unprecedented data analysis, court records, business filings, and interviews with dozens of people — revealed, for the first time, where 1.9 million of the fake comments originated, how they were generated, and the political operatives behind them. We also revealed that the operatives had worked on state-level campaigns that raised similar impersonation allegations.
By revealing the precise way in which the comment process had been compromised, our reporting has provided a level of specificity that may help spur reform. In his written opening statement to a Senate hearing on widespread problems in federal public commenting, Sen. Tom Carper — the ranking member of the Senate’s Permanent Subcommittee on Investigations — called BuzzFeed News’ findings “extremely troubling.”
In Texas, we unearthed new evidence regarding the role of LCX Digital in a controversial letter-writing campaign accused of impersonating local constituents. Stacy Hock, the chair of the school-choice group that commissioned the campaign, told BuzzFeed News that these findings were “alarming.” Hock told BuzzFeed News that the consulting firm that had hired LCX for the campaign — without her knowledge at the time, she said — had “launched an internal review” and was “demanding answers from LCX.” And depending on what those answers were, she said, “we will determine our future course of action, up to and including legal action.”
Our reporting also brought answers to people who knew they had been impersonated, but didn’t know by whom or how. One such person was Sarah Reeves; the political operatives had impersonated her mother — using her name, address, and email address — more than a year after she had died. They had also impersonated Sarah herself, attributing to her an opinion exactly the opposite of what she truly believed.
We used the federal Freedom of Information Act to obtain all comments submitted via the FCC’s bulk-uploading mechanism. (We also used FOIA to seek server logs pertaining to other comments; we were denied, we appealed, and then had that appeal denied, 3-2 along party lines, in a vote by the agency’s five commissioners.)
Once we had obtained the data, we ran large samples of the email addresses associated with those comments through Have I Been Pwned’s API (Application Programming Interface), to determine the data breaches in which they had been exposed. We believe this is the first-ever journalism investigation to make such extensive use of that service, and to use it to identify widespread public impersonation.
To analyze the Have I Been Pwned results, as well as to analyze patterns across millions of comments uploaded to the FCC over the past five years, we used a combination Python, Pandas, Jupyter, Matplotlib, xsv, and VisiData. To obtain comments from prior FCC proceedings, we wrote web-scrapers using Python and Requests.
We used a similar set of tools to analyze the Texas letter-writing data, with the addition of WHOIS records to obtain information about the owners of the IP addresses in that dataset.
We also used GitHub and the Internet Archive to publish to the final code, methodology, and anonymized data behind the FCC analyses.
What was the hardest part of this project?
The presence of fake comments in the 2017 FCC proceeding attracted attention at the time. Virtually every major publication in the country covered it. None, however, managed to trace the fake comments back to their data source. BuzzFeed News was able to do so through a unique combination of data analysis, online sleuthing, and traditional shoe-leather reporting.
One set of challenges involved obtaining, structuring, and analyzing more than 40 gigabytes of data directly relevant to the findings. The scale of those records required us to carefully design data-processing pipelines so that we could quickly run new analyses in response to new discoveries in our reporting.
Another challenge: Piercing LCX Digital’s veil of secrecy. Before we began our investigation, the company had received virtually zero scrutiny and had left only a very light trace online. It presented itself online as an innocuous digital advertising outfit with cutting-edge technology — a self-portrayal that went unchallenged until we started digging deeper. To unravel LCX and its main owner’s many lies, BuzzFeed News interviewed former employees and business partners, scoured old versions of LCX’s website on the Internet Archive’s Wayback Machine, and combed through scores of business filings in a half-dozen states.
One major breakthrough came when BuzzFeed News discovered, in a previously-unreported court case in San Diego County, an extraordinary deposition by one of the company’s cofounders. That deposition not only accused the LCX’s main owner of extensive deception, but also claimed that the company was a “completely fraudulent” enterprise. The details of that alleged fraud bore remarkable similarities to patterns BuzzFeed News was beginning to uncover in the FCC proceeding, Texas, and South Carolina.
What can others learn from this project?
In addition to overcoming the challenges described above, the investigation also demonstrated two novel techniques, which have applications beyond this specific story.
The first novel technique was to use Have I Been Pwned’s API to identify which email addresses have appeared in database breaches. BuzzFeed News ran a large sample of all email addresses in the FCC comments against the HIBP API, and then calculated the breach rate for all bulk-uploaders. One uploader — Media Bridge’s Shane Cory, who had worked on the comments with LCX — stood out as a massive outlier, and thus became a major focus of our reporting.
Digging deeper, the HIBP data allowed us to identify the specific database breach that contained the vast majority of the email addresses. Later, we were able to confirm that the personal data Media Bridge submitted to the FCC was exactly the same as it appeared in that breach, down to idiosyncratic spelling and punctuation.
The second technique was to reverse-engineer the Mad Libs–style algorithm Media Bridge used to generate the comments, and then present that algorithm as interactive graphic — as far as BuzzFeed News is aware, the first time a news organization has done something of that nature. To do so, BuzzFeed News wrote computer code to test hypotheses for how the algorithm worked, refine them, and prove the completeness of the reverse-engineered model. We ultimately calculated that the algorithm was capable of generating more than 9 sextillion “unique” comments.