A national lockdown was something new for everyone. Everyone felt that the country was changing. But what exactly was changing? We turned into data to figure it out.
What started as an idea to explore the pollution levels in the country snowballed into a longform data-driven article about how Portuguese society had changed in just 15 days. We ended up combing about 20 different data sources that showed us expectable things – such as the fact that pollution levels dropped – but others were really surprising – like the fact that people seemed to be listening to happier songs.
This article was pretty successful among Público readers because data was able to capture the zeitgeist of that moment where everyone was staying at home.
By combining such amount of data sources, everyone felt represented in the numbers. Some data reflected stuff everyone could see with their bare eye – like the fact that the number of car accidents dropped instantly with the lockdown or that people rushed to buy cans of food and toilet paper. But other data seemed pretty intimate: like the fact that the number of websites to quit smoking saw an increase in web-traffic or that people were watching more porn online.
Even though this is an educated guess, I believe that, in this case, data helped people understand the world around them in such weird times. Being able to see that, like them, other people were afraid to go to the hospitals, were trying to quit smoking, stopped googling about wedding dresses, or stopped streaming music that made them dance made everyone feel less of an outlier.
Because of the diversity of data sources used on this project, I had to be flexible. I’ve used R as the main programming language for web scraping and doing the data analysis. This implied several R packages, mostly connected with the tidyverse family. For some API data, such as Spotify’s API, I’ve packages that facilitate connecting with the API.
This ended up being a big project with multiple files that did different things like scraping or did data cleaning. A final file was responsible to export all the data used into one big .json file.
Since I’ve used chart.js as the main library to create the interactive charts, it was pretty useful to get the data on your json file just as you need.
What was the hardest part of this project?
The hardest part of this work is also what makes it unique: the fact that I had to be creative in getting the data. Doing data journalism gets easier when there is official data published. But when there is not, you have to go through the process of thinking about where such indicators could be available and how could I get that data. And then go through the process of getting it.
One dataset that took us too much time to get was the pollution data. I wanted to give the possibility to the reader to see the levels of pollution in their city. The data is published hourly on a Portuguese government website, so I built a scraper that got there every hour to collect the data. But the past hourly data was only published in .jpeg charts and we needed the data from at least January. Because OCR worked awfully on them, we turned it into an excel file manually.
Another example that reflects the need of being creative was the employment indicator. Official employment statistics are not published in 15 days – and we were not even sure if they would immediately reflect it. But with everything closed, it surely would be harder to get a job. So we turned into a popular job posting website, scrapped the number of job postings (the company kindly gave us the data since the beginning of February) and we found out that not only that number of job posts dropped, but that the kind of jobs being offered was different.
Even though some companies provided us some data, this work demanded a lot of code for scrapping and analyzing all the data.
What can others learn from this project?
I think there are two lessons to be taken from this work. The first one is something that most data journalists are already familiar with but that it’s good to always keep in mind: be creative about your data sources. Most of the time there is no “official data” about something and you need to be creative. For example, as I was talking with Mapbox to ask for the car traffic data, I remembered that civil protection authorities in Portugal log on their website every time firemen, ambulances, or police cars are needed. Normally, as a reporter, I use that data to report about weather conditions or wildfires. But I remembered that ambulances and police are also required when there are car accidents.
The second lesson it’s the kind of advice an ambitious data journalist gets from his editor: sometimes you have to say “it’s enough”. This work started to be only about pollution. But suddenly new ideas appeared: ‘let’s check out the number of books sold, let’s check out the number of arrivals at the airports, let’s check out what people are tweeting about, let’s check out if people are listening to sadder songs’. And new ideas could keep coming if I didn’t set a deadline for the work. If I had more time to do it, I’m sure I would find more than 20 data sources for it.