Open Journalism Data by SRF Data
Category: Open data
Organisation: SRF Data
Organisation size: Small
Publication date: 1 Jan 2019
Credit: Timo Grossenbacher, Angelo Zehr, Felix Michel, Julian Schmidli
Apart from covering news stories with data-driven explainers or uncovering misconduct or corruption through data investigations, SRF Data also publishes the large majority of code and data behind its stories. This often concerns governmental data that was either previously unpublished or that was in a inaccessible format. All code and data is published on an overview page, which leads to the respective stories and GitHub repositories. With this service, SRF Data is a major producer of open data in Switzerland – data that can be re-used and re-published by other journalists and the public.
Unlike with many other news organizations, each published code repository is described in great detail. The consistent use of RMarkdown and the provision of all necessary raw data allows our analysis scripts to be fully reproduced and understood by laymen. In the last years, a lot of additional effort was put into “full” reproducibility, meaning that a script can be executed several years after the publication – as if it was frozen in time. To our understanding, this practice is unique in data journalism. Alone in 2019, 13 new repositories were published on srfdata.github.io. This makes it the most active year since starting the publishing practice in 2015. Two examples of reproducible scripts that were published last year are linked to. First, the data analysis behind a story for the Swiss Federal Elections in 2019: “What worries the Swiss?” (link 2). The visualization behind the project won the Kantar Information is Beautiful Silver Award in “News & Current Affairs”.The published code on srfdata.github.io does not only show how the data was sourced and pre-processed, but also gives insights into the design process for the final product (with different exploratory analyses). Second, also during Federal Elections, we tapped into Facebook’s Ad Library API to monitor ad activities of policital parties and candidates (link 3). This was the first time the Ad Library was made available in Switzerland and we wanted to have some sort of dashboard that would allow us – and other interested journalists – to have a continuous update on how much money is spent on Facebook for political ads. Therefore we created an R script that would query the API every night and download the newest ad data for Switzerland. After that, another script would preprocess these data, filter them for political parties, automatically generate visualizations and publish
We mostly use R & RMarkdown, Git and GitHub Pages. Most of the R stuff on srfdata.github.io is based on the reproducible R template by Timo Grossenbacher, which is also freely available (link 4). There, it is also explained how to easily construct a GitHub Page like we did.
That template allows us to publish code and data as smooth and efficient as possible, keeping the resource footprint of reproducibility low (which we think is often a show stopper, as it takes too much time for many news organizations). Secondly, the template helps in “freezing” the software packages used for the scripts, thus making the scripts reproducible even years after initial publication and in a different software environment (e.g. different OS). This practice is, to our knowledge, pretty unique in reproducible data journalism. See below for further explanation. A second innovative aspect of the template is that the RMarkdown code can be deployed and published on a Github page by merely running a small bash script.
What was the hardest part of this project?
While me made our reproducibility publication process as efficient as possible, maintaining and updating our template still took and takes quite some time. This is also one of the reasons we decided to make the template itself public, as we are a publicly financed broadcaster. The biggest challenge remains to ensure actual and full reproducibility, given that software (R packages, for example) and software environments (OSes, third-party libraries) changes rapidly. After a few years, we discovered that this would actually render our scripts useless (or worse: they would still run but produce different, even wrong, results). One such case is described in a blog post (link 5). Here we executed a two-year old script and only thanks to Git we realized that some of the output data were transformed differently. This was due to an updated R package. Therefore, we invested a lot of time in making the software environment itself reproducible (through the so-called “checkpoint” mechanism), which resulted in the aforementioned template system (link 4).
What can others learn from this project?