Over several years, we exclusively gathered, analysed and finally reported about more than 400 billion Euro in farm subsidies to all 27 European countries, the biggest part of the European budget. We found nepotism, waste, environmental harm and animal abuse being funded by the European Comission. In cooperation with WDR, NDR, Süddeutsche Zeitung, Correctiv, Der Standard, IrpiMedia, Reporter.lu, Reporters United, Expresso, Follow The Money and Gazeta Wyborcza, FragDenStaat analysed the data and published stories.
We initiated a discussion among the general public and policy professionals as well as published several articles in half a dozen European countries. The database is now open to use for everyone, we’re holding seminars and are presenting at conferences. Since it only launched about a month ago, we’re confident the publication will lead to further inquiries, investigations and discussions in the future.
Handling such a big dataset with ~22 mio recipients and ~136 mio payments requires technology choices that are usually not common in data journalism. We built a complete ETL-pipeline from extracting the source data, transforming it into a common format and loading it into a database. Such pipelines are common in data science industry but barely used within journalism. The production database of farmsubsidy.org is built upon Clickhouse, a state of the art column store that allows very fast lookups and aggregations about huge datasets like this one. The biggest challenge during the investigation was to cross-match recipients against company registries (and other registries such as the EU Lobbyregister or sanctions list) and to get a sense for the “type” of a recipient, if it is a company or a natural person. For both tasks we used open source technology involving machine learning models, such as the python tool “nomenklatura” for deduplication / cross-matching and in general the “followthemoney” toolchain from the aleph team (Organized Crime and Corruption Project). We trained our own machine learning model via pythons “fasttext” to predict the types of recipients. The website farmsubsidy.org is built with Next.js to allow static pre-rendering. To geocode the recipients addresses, we developed a batch geocoding tool called “ftm-geocode” that worked against a self-hosted instance of OpenStreetmap’s “Nominatim” to allow fast geocoding of ~500k addresses without hitting the rate limits of the official nominatim instance.
Context about the project:
It’s important to note that the access to this huge amount of data has only been possible because the involved data team has been on this project for almost a decade. The data had to be archived every year because the EU and its member states – if they display it at all – only show the last two years of subsidies at any given time. Only through years long dedication was it possible to gather the data for eight consecutive years. In addition, in a couple of countries this data was only available through individual FOIA requests. This project came together about ten years ago through a cross-border collatoration initiated at Dataharvest and it’s a project that’s not possible for any single, traditional publishing house to build or maintain.
What can other journalists learn from this project?
If you know your way around specific EU regulations, bring data journalists and traditional reporters together, use data, FOIA and reporting tools wisely and follow through over a couple of years, you can build very important public value investigations.