The Citizen Browser project is a flashlight inside the black box of Facebook’s algorithms, allowing The Markup to monitor what content Facebook decides to amplify in people’s news feeds. In a year when Facebook is under unyielding scrutiny as whistleblowers come forward to unmask the realities of Facebook’s intentions, Citizen Browser is an essential tool that delivers the ground truth and enables empirical research on platform accountability. It is one of the only windows the public has into the impact of Facebook’s algorithms on its users.
Our reporting series highlights Facebook’s failures to live up to its own promises.
Following the publication of “Facebook Said It Would Stop Pushing Users to Join Partisan Political Groups. It Didn’t,” Sen. Ed Markey cited The Markup’s work in a Jan. 26 letter to Facebook CEO Mark Zuckerberg questioning the company’s broken promises regarding the promotion of political groups to its users. Leaked documents reveal that Facebook scrambled to address the issues raised by The Markup’s article the day it ran. Internal teams investigated the “leakage” of political groups into recommendations, and the issue was escalated to Zuckerberg himself. Employees identified several technical issues that may have contributed to political groups being recommended to users, and an employee declared the problem had been “mitigated” by Jan. 25, six days after the story ran. As recently as June, our data suggests that Facebook’s algorithms have continued to recommend political groups to its users.
Our report “Credit Card Ads Were Targeted by Age, Violating Facebook’s Anti-Discrimination Policy” caught congressional attention as well. After Facebook pledged to purge its site of the discriminatory financial services ads that The Markup uncovered in its research, U.S. senator Mazie Hirono cited The Markup’s reporting in her letter to Monika Bickert, Facebook’s vice president of content and policy, calling the company’s response “evasive and inadequate.”
One of the companies referenced in this story, Hometap, used our research to prompt an audit of its ads with its ad agency. According to a statement from Hometap’s head of marketing, Rachel Keohan, “Following your outreach, we worked with our third-party digital agency to audit our ad campaigns, and determined that many of our Facebook advertisements were, in fact, still utilizing age ranges for targeting purposes. We’re in the process of updating all of our Facebook advertisements to no longer target audiences based on age.”
The Citizen Browser project is a pioneering investigation combining the infrastructure of national polling with modern data collection techniques. Our panel of users automatically shares data with us from their Facebook feeds, allowing us rare visibility into what content is pushed by Facebook’s algorithms.
Our panelists include more than 3,500 paid participants in the U.S. and 600 in Germany. The resulting dataset contains more than 20 million posts, 57 million recommended groups, and three million targeted advertisements.
To ensure panelists’ privacy, we built a data pipeline that automatically removes personally identifiable information (PII) from panelists’ feeds before making their data available for analysis. Maintaining data quality and panelist privacy requires constant upkeep: As Facebook updates its software, we must monitor and update code as well.
The application, data-processing pipeline, and underlying cloud infrastructure were audited by third-party security research firm Trail of Bits.
Analyzing this vast dataset required keyword analysis, linear regressions, correlation analysis, classification, and ranking comparisons. We joined panelists’ data with information they provided about their demographic and political affiliations.
We also built interactive tools for the public. In Split Screen, readers can see differences in news sources, hashtags, and group recommendations between groups of Citizen Browser panelists. In our Trending on Facebook Twitter bot, we provide daily updates of the content that appeared most often in the past 24 hours in our panel. This bot accompanies our report on how sensationalist, partisan posts are more popular in feeds than Facebook claims.
What was the hardest part of this project?
Attempting to independently monitor Facebook is a massive challenge. Many research scientists have tried and failed to overcome the technical and legal hurdles to providing oversight of the world’s largest social network.
Facebook has a history of shutting down or dismissing attempts to monitor its platform. In 2019, the company made changes to obfuscate its code in a way that blocked ad collection efforts by ProPublica, Mozilla, and ad transparency group WhoTargetsMe. This summer, Facebook shut down the accounts of researchers working with the NYU Ad Observatory and then implemented new code that foiled automated data collection of posts—a technique researchers and journalists use to audit what’s happening on the platform on a large scale.
We tackled these challenges in two ways. First, we built our system in a privacy-preserving manner that we hoped would blunt any legal argument from Facebook about our compromising its users’ privacy. We used an isolated browser profile on the panelists’ computers to store sensitive Facebook session information, and we built a data pipeline for redacting PII. We built our cloud infrastructure so that no unredacted data could be seen by a person. We had all our software audited by a security research firm to verify these measures.
Secondly, we spend a lot of time adapting our software to the many changes in the tech platform’s software. Sometimes those changes happen when Facebook introduces a new feature, like the “flags” related to COVID-19 released this summer. The messages in these flags addressed the panelist by name, so we needed to update our redactors to strip that out. Other times—as when the platform modified accessibility attributes in its HTML to make it harder to rely on them to parse data from the page—it seems as if Facebook is updating its software to intentionally hinder our work.
What can others learn from this project?
The most significant takeaway from this project is the importance of integrating engineering and editorial in the newsroom. Unlike in most newsrooms, our engineers and reporters work together on the same team and report to the same investigative editors. That allows them to work hand in hand to peek behind the curtain of Facebook’s algorithms. No one was willing to accept that those algorithms could forever remain opaque. We couldn’t have told these stories without the technology, and we couldn’t have built the tech without that shared mission.
Relatedly, holding tech platforms accountable requires speaking tech fluently yourself. Not every reporter needs to be able to tell their C++ from their C#, but being willing to question assumptions, shake off a fear of numbers, and dive headfirst into data can go a long way toward walking the walk.
We also hope other journalists recognize and emulate our prioritization of privacy. We were able to glean a tremendous amount of information from our panelists’ Facebook feeds all while honoring our promise to respect their privacy and never compromise their personal information. Yes, this required a bit of legwork to make possible, but we were willing to pay the “privacy tax” for a principle we hold so sacred.
Finally: show your work. We publish everything on GitHub for two reasons. It enables other newsrooms to repurpose our data, slicing and dicing it for their own investigations and takeaways, and we hope they do. And we also believe in transparency: We hope other newsrooms follow suit with publishing methodologies to show their work, helping fight the “fake news” narrative that journalists face.