2020
Here’s what we found in the Collection #1-5 password leaks
Category: Best data-driven reporting (small and large newsrooms)
Country/area: Switzerland
Organisation: Swiss Radio & Television’s SRF Data
Organisation size: Small
Publication date: 3 Jun 2019

Credit: Timo Grossenbacher
Project description:
This was an investigation deep into the heart of the “Collection #1-5” password leaks that appeared in the web in early 2019. I showed that more than 3 million Swiss e-mail addresses and – more disquietingly – over 20’000 email addresses of Swiss authorities and providers of critical infrastructure appear in the leak. Among them also the Swiss army, of which I’ve found over 500 e-mail addresses and passwords in the leak. For this project, I used a so-called “big data technology”, namely Spark, to sift through the humungous amount of data.
Impact reached:
The investigation showed that Swiss authorities still seem to have a problem with their employees using their business e-mail for third-party services that got hacked in the last years (think of Yahoo, LinkedIn, Adobe.com et al.). Interestingly, some institutions seem to be more affected than others: For example the Swiss army, which is grossly overrepresented among all federal e-mail addresses that can be found in the leak. Various media outlets from Switzerland followed up on this fact. Apart from that, I can only guess that some of the authorities affected by the password leaks might reconsider their security practices. While some might have been alerted before my investigation, our publication might have put even more pressure on them to pay more attention to their password policies in the future.
Techniques/technologies used:
Actually I wrote a whole making-of of the project, available on my blog. In summary, I came up with a quite complicated data processing pipeline consisting of several R scripts. These would use sparklyr, the R wrapper package for the Spark big data technology. I chose Spark because it makes it easier to process data that does not fit into memory, so it abstracts away some of the problems that would arise when dealing with the data in “plain” R. In summary, the process was three-fold: In the first step, I made the thousands of files in the leak searchable by sanitizing file names. I then parsed the contents of the files which would often be CSVs, i.e. I primarily parsed the e-mail addresses into user, domains, subdomains, etc. In the second step, I filtered the data to only contain Swiss e-mail addresses (.ch top-level domain). In the last step, I made the usual aggregations and analyses on the nicely processed data (i.e. how many addresses per domain and subdomain, average password length, etc.).
What was the hardest part of this project?
Certainly the large data volume and its variety in terms of various file formats. While it was quite straightforward to find and download the data, the problem boiled down to preprocessing, filtering and searching through a very, very large data set. This challenged me to learn to use a new technology (Spark) in conjunction with R. After all, it would have been easier to contact a company specialized in cyber security and ask them whether they can search the leak for Swiss E-mail addresses. Yet I think that tackling such problems wholly inside newsrooms has one tremendous benefit: You completely control what is done with the data, and you know exactly what can be interpreted into it – and what not. At the same time, since we’re dealing with sensitive data and methods that can be used for malicious purposes, the approach cannot be made fully transparent. Thus, instead of publishing the source code, a blog post that explained the approach so that others can learn from it, was written in addition. This effort, undertaken by a single journalist, should certainly be considered.
What can others learn from this project?
I think one of the biggest take aways of this project was to reduce the data volume early on so that it became gradually easier to work with the data. Secondly, investing into a robust data processing pipeline early on is key. That means that you should come up with an automated workflow that can run for days without crashing – and when it crashes, it should restart itself. Think of cron jobs and shell scripts that check log files for errors. Lastly, in this project I used R and sparklyr, which is a wrapper for Spark in R. Sometimes, I had a hard time finding good documentation for sparklyr. It might have been easier to work with Python, as the documentation for its Spark wrapper is better. So as a last advice I would argue that sometimes you should be willing to kill your darlings, e.g. use another scripting language than your favourite one, even though you might have to dig into it first. For more take home messages, consider reading my blog post about the project.
Project links:
timogrossenbacher.ch/wp-content/uploads/2020/01/translation.txt
timogrossenbacher.ch/2019/03/big-data-journalism-with-spark-and-r/