We looked at the UK’s most popular health websites to see how they treat sensitive data from their users. Our investigation revealed that some of the health sites are sharing this information, including symptoms and drug names, with hundreds of third parties including Google and Facebook. In some cases this included an identifier potentially allowing the data to be tied to an individual. This was done without the explicit consent that is legally required in the UK.
The story was the front page splash on the UK print edition of the Financial Times on 14th November 2019. It was picked up by the tech and health press, including the MIT Technology Review and BoingBoing. The UK’s Information Commissioner’s Office (ICO), which is responsible for data protection, issued guidance later that day on what the law says about dealing with the highly-personal ‘special category’ data, such as health.
The following day Google, which our story showed was the biggest recipient of this data of the sites we looked at, announced plans to limit advertisers’ access to personal data when they bid for adverts.
I started with a list of the top 100 health sites produced by SimilarWeb, based on average UK monthly traffic. I ran this list through WebXray, an open-source tool. This opens each site and records all the HTTP requests made to third parties. It produces data which shows which sites contact third parties, and the domain names of those third parties.
Manual research primarily using Whois was used to link those domain names to companies, as many companies use multiple domains, and often do not include the company name.
However, this didn’t show exactly what information was being sent. To do this I used a tool called HTTP Toolkit, which intercepts all the requests being sent out by a site, and lets you search and explore the information they contain. We picked a few sites to look closer at that both ask for specific health information and were contacting many third parties. I loaded up each site and filled in some information.
At this point I could check, before giving any consent, whether the site dropped a cookie. I also looked for anything that looked like the information I had given it, and anything that looked like a user identifier.
One of the more complex elements of the story to understand is that the third parties that have their ‘tags’ (code snippets) on the page often connect to other third parties, which in turn can bring in others, and so on. Furthermore, the extent to which this happens varies substantially in different parties of the world, likely because of different regulatory environments. To visualise this we gained access to Trackermap, a tool that displays the structure of these networks from different servers around the world. I scraped the data it produced, and processed it in Gephi using its layout algorithms.
What was the hardest part of this project?
The hardest part of this project was making a story based around arcane technical concepts and legal requirements understandable and meaningful to readers.
We knew this could be an issue from the start. In fact, the original intention was more generally to look at ‘special category’ data, which has extra legal protection — including information about race, political opinions, and genetic data, as well as health. The decision to focus on health was made as it is a topic that affects absolutely everyone, giving us the broadest audience of people who would be able to personally relate to the story.
We also wanted to ensure that readers understood how this type of information was being tracked and shared without their knowledge or even consent in some instances. In particular, we wanted to explain how exactly this happened, and touch on the technical details that underpinned our reporting, but in a way that was understandable to the typical reader.
To do this, we picked one particular site, the WebMD symptom checker. This was a good example as not only was it a very popular well-known site, but, looking at the page, it doesn’t have a lot of adverts — just two straightforward images. It appears benign. However almost all the incredibly personal data that people are likely to enter is sent to Facebook’s advertising platform. We created a video walking through someone entering their symptoms, and then me showing how this information was ending up in HTTP requests to Facebook’s domain. We aimed to avoid technical jargon, but without dumbing-down.
What can others learn from this project?
(1) That it is often the simplest ideas that hold most relevance to readers and create the biggest results. I have been involved in many projects where a lot more time has been invested for a lot less impact.
(2) A story such as this has many technical details. Details that I’d spent a lot of time obsessing over, and were essential to the data we’d collected that formed the foundation of the story. But you need to be brave enough to leave many of them out to produce a story that’s comprehensible to the readers we wanted to reach — whilst also giving enough so that readers trust our process. It is a difficult balance, and the temptation is to include all the numbers, interleaved with caveats and technicalities, so any potential criticism can be swatted. But just making a conscious effort to get this right can make a big difference.
(3) One criticism that I was expecting, but didn’t really end up receiving was — ‘don’t we already know this?’ There have been many stories on privacy and the advertising industry, cookies, etc. I believe the reason we didn’t is because we brought it into situations close to readers’ lives. That’s manifest in picking health as a topic, the video walking through a real situation, and the juxtaposition in the piece between the hippocratic oath of times past, and the graphic laying out the technical and commercial reality of who receives information about your health today.