With the covid-19 pandemic, governments across the world increased their spending on several medical equipment and services to fight the spread of the vírus. Portugal was no exception. But how much was spent? And which companies profited the most with the pandemic? No one seemed to know. Using the Portuguese Public tenders database and Machine Learning, we were able to give some answers. This investigation was done in collaboration with other european newsrooms, so we ended up being able to compare countries expenses (and their lack of transparency).
This story was Publico’s front page and it was cited along all other portuguese media outlets.
The investigation, that resulted in three articles (published both online and in the print version of the newspaper) and a news app, was very impactful since it shed some light to the business side of the pandemic.
The fact that we also decided to publish a news application that allowed everyone to look at their local hospital/local government/school or any other public institution was also very important because it allowed people to scrutinise their local institutions.
Because this investigation was made in collaboration with other European journalists, we also found out that not every country has the same open data policies regarding covid-19 public tenders, being Portugal one of the few countries that decided to publish all the data. That was also very important to highlight a good policy regarding public spending.
This was probably the longest and most ambitious data project I’ve ever worked on. The portuguese government was publishing an excel file with all the contracts, but I ended up realizing that a big amount of purchases related to covid-19 were not on this database. So I built a scraper with about 90 covid-19 keywords to everyday get the data from the public tenders website, do the consecutive data transformations needed, get information about those companies from other websites and put it in a shared spreadsheet that everyone collaborating in the project was able to access. I’ve used NLP and the tidytext to get the most used words on the contracts descriptions to find out more possible words to add to the scraper.
We agreed with the international partners that it would be useful to classify those contracts by kind of purchase (ventilators, masks, etc). Since most european countries were not publishing all the data, it was a doable task for everyone working on the project. But Portugal had about 16.000 contracts. So it suddenly became a very hard task. Using Machine Learning and the previous 5000 already classified contacts, I built a model that relied on tf-idf and other fields from the database to do those classifications. It was a very imperfect model (maybe someone with more experience in ML would do a better job at this) with only 70% of success. So we checked the model classifications everytime it ran.
For the news app, I’ve used vue.js to build the website and plumber R package to build the API for the app. All the other analyses were done using R.
What was the hardest part of this project?
Dealing with so much messy data. The public tenders database doesn’t have a proper API, so I had to use the export params on the url to automate stuff. But the returned data was messy, and cleaning it to match the agreed data format with the international partners was hard.
But the hardest challenge was to read 16.000 contracts. Even though the ML model helped a lot, it still needed to be fact-checked. And because most of the contract descriptions were not explicit enough, we often needed to check the pdf files on the public tenders database to understand what that was, if it was related to the pandemic and what was the price unit of that product – that was the only way to get, for example, how the FFP2 masks prices have variated along the time.
What can others learn from this project?
That Machine Learning can save your investigation. Without Machine Learning, it would take us ages to classify everything. Even though we had to fact-check the results provided by the model, M.A.R.A (the name of the model, named after the intern that helped me classify the first 5000 contracts) did the work for the easy to classify contracts, leaving us time to focus on the ones that were harder to classify.
But it also showed me that you can’t bindly trust ML. Maybe my lack of experience using ML was what drove me there, but, for example, M.A.R.A kept thinking that stuff bought by Madeira’s regional government was construction related purchases (Madeira can be translated to Wood in portuguese).