Salud, Dinero y Corrupción is a data journalism project led by PODER that showed that the Mexican Social Security Institute was buying drugs at overpriced prices while there was a shortage of vital supplies in hospitals.
It is a project that goes through the whole process of data journalism: webscraping, cleaning and analysis, data science, collaborative journalistic research among six media outlets and massive publication of data for reuse by the audience.
Salud, Dinero y Corrupción had and is having a very important impact on multiple actors involved in the purchase of medicines in Mexico. Dividing it into blocks:
– Mexican Institute of Social Security: on the same day of publication it issued a statement acknowledging the project’s information and announcing internal investigations. The statement can be found on the project’s web page.
– Government and opposition: Since the project crosses several administrations the information was taken up by several political parties appearing both in the President’s morning conference and in the Senate’s oversight session questions.
– UNOPS: Following the publication, the UN procurement agency has twice updated its open procurement transparency portal to facilitate access to drug data.
– Patients and civil society: The project was key in demonstrating that the consolidated purchasing system did not work and changed both the discourse and advocacy strategies of organizations.
Considering the complexity of the project we used different techniques and technologies in each of the steps. Reconstructing the process:
Data collection: We did a webscrapping that first obtained the more than two million valid addresses from the portal compras.imss.gob.mx and then extracted all the data. The code was developed in node.js and to increase speed it was parallelized on several servers.
Data cleaning and analysis: This was the most complex process and where we invested the most time. The main tool we used was Elastic’s Kibana, but it was complemented with R, Python, LibreCalc, Google Spreedsheets, OCR processes and plain text editors, according to needs and skills.
Homepage: The page was built with webflow, making graphics in svg and animations in gif.
What was the hardest part of this project?
The main challenge of this project was to deal with the more than two million medicine purchases and convert this data into usable information. I would highlight three key processes:
Data cleaning: while the medication code was constant, the name contained small variations and the description was inconsistent. In order to systematize this information we had to obtain the Basic List of Medicines and Medical Supplies (CBMEI) of Mexico, since this was medical information we had to be very careful.
Categorization: We have 26,194 different products purchased by IMSS. Finding a way to group them so that the analysis would make sense was a huge challenge. In the end, thanks to the CBMEI and an advanced understanding of the coding of the items, we classified them into medical supplies, consumables, clothing and fabrics, medicines and vaccines, and furniture.
Cost overrun methodology: At the heart of this research was the cost overrun, so the calculation had to be very precise and indisputable. We went round and round with more complex options, but the simplicity of averaging and subtraction proved to be effective and understandable.
What can others learn from this project?
In the first instance this is a collaborative project led by PODER. At all times we understood that the rest of the media would not be able to use the analysis tools and processes and we designed learning processes so that they could investigate. We continue to train other journalists in Mexico in the use of our internal tools.
As for the community of data journalists, we believe there are several interesting learnings that we can contribute:
A methodology for calculating drug and medical supply cost overruns that has been tested and accepted de facto by the largest Social Security Institute in Latin America.
A data journalism process that includes journalists, programmers, designers and illustrators, working closely with civil society and patients.
A willingness to share data and create collaborative projects as well as a constancy in releasing the data it uses on the QuiénEsQuién.Wiki platform and publishing all its code on github.