Blacklight is a work of experiential journalism that allows people to investigate the state of web privacy in real time and on their own terms. Blacklight instantly reveals the potential privacy violations on any website—and names the companies tracking you.
The tool was used for a companion investigative story, which examined the role of free website building tools in inserting trackers on small, unsuspecting websites, including those that serve vulnerable populations.
Nearly one million people had used Blacklight to scan websites by early January, just a few months after we’d released it. Some have reached out to us to say they used the results to pressure the organizations they work for or websites they frequent to remove tracking technologies. At its height, when it was featured on the front page of Reddit, Blacklight was conducting more than 300 user-initiated scans every minute.
Many website operators contacted for the “The High Privacy Cost of a ‘Free’ Website” investigation removed user-tracking technology from their sites after we brought it to their attention—including several government webpages. Other website operators tweeted about doing it on their own after reading our story and scanning their site.
One smaller search engine even incorporated Blacklight, the code of which we published open-source, into its own product, allowing users to scan a featured site on the search results page before visiting. Some people are using Blacklight to scan the sites they rely on—and calling them out on Twitter for their ad-based tracking.
Reporters from Forbes, The Logic, and Vox used Blacklight to scan their employers’ sites and wrote about the results and how user tracking is employed in advertising-funded news operations. And a computational journalism class at Stanford has incorporated Blacklight into the syllabus.
Congresswoman Anna Eshoo, who represents Silicon Valley, wrote a letter to The Markup saying, “While companies that profit from surveillance capitalism may be upset by your decision, I stand in full support of this tool.”
Blacklight was written in Node.js and relies on AWS’s Lambda, S3, and Cloudfront services. Using this combination of tools allowed us to build a real-time website privacy inspector that was very precise—the Node.js puppeteer module gave us full control over a browser to run our tests—and is also easy to scale to user demand. Blacklight has never gone down, even when it was receiving more than 300 requests a minute.
The privacy tests the tool runs were designed using existing research to study the techniques used by tracking scripts, then programmatically scraping thousands of websites to find instances of these techniques. This approach ensured that we were testing for things seen in the real world, not just in academic papers.
In addition, we tried to account for how Blacklight might be abused by malicious actors. For this reason we cache results for 24-48 hours on S3 so we’re not hitting the same website more than once a day.
The analysis and raw data on which the results are based can be downloaded from the tool, so that it can be used by journalists, researchers and others.
What was the hardest part of this project?
We faced several technical challenges. This tool required a lot of development for both the data collection and the analysis.
First, we had to determine how to measure the various potential privacy invasions of a website and also explain them in ways that were both precise and easy for a non-technical audience to understand. Second, we had to do a lot of testing to ensure we were getting accurate results for a large variety of websites and tracking vendors. We ended up collecting data for more than two million websites.
Blacklight carries out sophisticated tests, but it’s open-source code was written to be accessible to both professional and beginner programmers. We wanted to ensure technically literate folks could use the tool for their own purposes.
We also faced the challenge of explaining the privacy violations—and the role various companies played—in ways that were both accurate and understandable to a general audience.
What can others learn from this project?
Blacklight was a tool made, first and foremost, by journalists for journalists. We developed it to support our investigation “The High Privacy Cost of a ‘Free’ Website.” Since launch, a number of newsrooms have already used it for their own stories, including Forbes and Vox. A computational journalism class at Stanford University is launching a project this semester using Blacklight.
When it comes to online tracking, we talk a lot about the data and not enough about the companies that are deploying invasive practices to collect that data. Our tool shines a light on which companies are trying to get your data from a given website.
Blacklight was also built with accountability in mind; it allows readers to download a copy of the inspection report, along with all the data used to generate it. This data is used to make descriptive claims about the kinds of privacy violations we found on websites. Given the dynamic nature of the internet, it can be hard to make such claims with confidence. The inspection archive makes that possible by saving a snapshot of what was found.