TodosLosContratos.mx is a massive open data endeavor. After cleaning and standardizing 4 million Mexico government contracts, the team built a website that provided top-line numbers and easy ways into this large database. But they didn’t stop there. They published all the data in a well-designed search engine, and a well-documented API. This project not only informed the general public but also empowered other journalists and researchers.
TodosLosContratos.mx (All the contracts) is a data journalism project that has compiled almost 4 million public contracts made between 2001 and 2019 by the Mexican Federal Government. The project mixes journalistic reports that explain cases of corruption and bad practices in the mexican procurement system, with rankings based on algorithms specifically designed for the mexican by the team. The objective of the project is to promote accountability in the contracting process in Mexico, so we published all the data in QuiénEsQuién.wiki platform and API, opened the methodology of the analysis algorithms and published a guide on how to investigate with this tool.
The publication of TodosLosContratos.mx together with the uploading of the data in QuiénEsQuién.Wiki has had three main impacts:
– Simplify the journalistic investigation of public contracts. The publication of the vast majority of contracts of the Mexican federal administration in a usable and reliable search engine has increase the productivity of the journalist, this has been expressed to us by journalists from Mexican outlets like Animal Político, Aristegui Noticias, El Universal, Cuestione, Proceso, among others, also local mexican online newspaper like Zona Docs, BI Noticias, Lado B or Cuestione, and International newspapers like AJ+ in spanish and El Faro (El Salvador).
– Promote the opening of public contracting data. Following our publication three government agencies have approached us to know how they can improve or upload new data to our platform. We have given them advice on how to improve their open data strategies; and once they publish we will update QuiénEsQuién.Wiki and our algorithmic analysis in TodosLosContratos 2020 edition.
– To increase the knowledge and interest of the citizens about the public procurement. As a result of the project, more people know how public contracting works and can easily consult it. Visits to the QuiénEsQuién.Wiki platform are increasing exponentially and every week we receive messages from people with doubts or clarifications about contracts or their participants.
A project of this complexity has several processes and key technologies:
– Data Import: Based in the free software Apache NiFi we have developed an importer and web scraper orchestrator. This modular software allows us to have a simple setting for reusable components like the data cleaning module or the data update module.
– Platform and API: QuiénEsQuién.Wiki is based on a mongoDB+node.js, all the data is hosted in a Kubernetes cluster of MongoDB databases and then exposed through a public API which is documented both in Spanish and English. Plus a model client in node js is usable with the NPM package registry. The website consumes the API and is compatible with desktop, tablets and mobile devices.
– Algorithmic analysis: Our “groucho” engine for analyzing open contracting data in the OCDS data standard. The engine is published with a GPL license, which makes it reusable and transparent. It’s written in Node.JS.
– Data analysis: In order to fine tune the parameters of the algorithmic analysis engine we have combed through the data with the help of Kibana, an open source data visualization dashboard based on the ElasticSearch database engine, which helped us to quickly recognize patterns and detect deviations.
– Data visualization: Our data is nicely presented using custom designed web-based interactive graphs and maps using primarily the D3.js library.
What was the hardest part of this project?
For this project, our interdisciplinary team took the enormous task of automating the cleaning, compilation, transformation and analysis of 4 million contracts from 64 different tables of government-published data, a highlight of the hardest parts follows: – Data cleaning: The mexican government does not have a practice of unifying the name of the suppliers, neither they provide a unique identifier. Our “lavadora empresarial” software (also GPL) takes care of detecting duplicates with different spellings and other common errors, while avoiding to merge different but similar companies. For example, here’s the page for Televisa in QuienEsQuien.wiki showing all the 23 different spellings of their name across 535 contracts. – Data transformation and compilation: Contracts from all sources are converted to the OCDS standard using specific mappings for each source, which can be very intricate with complex dependencies for the field values. 64 datasets are published in 5 different data structures, each of them requiring different pipelines in our Apache NiFi setup. These databases contain repeated contracts and several entries for the same contracting process which can only be compiled after they are transformed to OCDS standard. – Data analysis in an interdisciplinary team: Creating work tools which can be used by both journalists, programmers and analysts took several months and several long meeting until agreements were reached on the best way to capture specific malpractices in contracts or on why we could or couldn’t perform specific evaluations with the available data.
What can others learn from this project?
Sharing our learned lessons is one of the main goals of the project, and encouraging others to emulate this kind of project. As we have said all of our projects are based in free software solutions, our own code is published in GPL licenses, all of our data and methodologies is published in CC-BY licenses. And all our reports are properly quote their sources. Plus we have documented the usage of our tools in Spanish and English, making everything we’ve done entirely reusable. We think the main takeaway is that it is possible to measure corruption based on public contracting data and we are starting to see the possibility of one day no longer relying on corruption perception surveys. Having a team that is committed to making bold assumptions and running deep journalistic analysis based in data was a key asset to accomplish our impact goals and to highlight our organization as one of the most advanced in the latinamerican region.