Data-driven stories to monitor the Government’s strategy in response to the pandemic
Organisation: LA NACION
Organisation size: Big
Publication date: 19 May 2020
Credit: Fabiola Czubaj, Gabriela Bouret, Gabriel Alonso Quiroga, Pablo Arellano, Mariana Trigo Viera, Nicolas Rivera, Delfina Arambillet.
Given the context of a pandemic and the lack of culture of transparency in Argentina, we proposed to monitor the sanitary strategy and evaluate que quality of public data. This was a hard task since several government sources coexist with information the data of which do not agree. Multiple pieces were prepared from the analysis of said information evidencing contradictions between data and official discourse, which obliged the Government to make some changes on its strategy. The lack of tests, the precarious epidemiologic surveillance system and several problems to give real-time answers were some of the results.
Based on our data analysis and journalistic productions, we included the discussion about data quality and their accessibility in the public agenda. Moreover, we created pieces every day to monitor sanitary decisions of the National Government. Thus, for example, as regards “What are the most common symptoms of the Argentinians infected?”, this story caused a concrete change in the management of pandemic: the Ministry of Health changed the definition for suspicious case and included four of the eight more frequent symptoms in order that people may access to a test and subsequent diagnose. We also noted inconsistencies between public data and official discourse, such as the differences in the test volumes in the two national jurisdictions with higher number of cases. Moreover, we found mistakes in the data provided by the epidemiologic surveillance national system: entries are deleted from the public access database and there are inconsistencies in the changes of status of the same patient. This leads to a decrease in a number of positive cases informed to citizens and to international agencies.
In addition, all these pieces activated the demand for public information and made possible a discussion about new problems that otherwise would not have happened, since information was not easily accessed by citizens.
it was used Python, BigQuery, Excel and SQL for the analysis and data process. From the moment the Government began publishing the dataset with Covid cases informed, it was created a program in Python that scraps the files and stores them in a database every day. As from June 2020, we have been storing 180 files, with an average of 2.5 million entries each. Then, different queries are carried out in SQL every day to monitor certain patterns. Almost all of the analysis for the journalistic pieces presented was made under this logic.
In order to carry out a more complete monitoring of the data provided by the Government, we had to research new technologies as BigQuery, where we archived all files on the same table to determine any delay in the uploading of cases and number of entries deleted every day. There are two automatized processes: scrapping and the upload of all files on the same table and the running of queries for their subsequent visualization.
In addition, a more detailed preprocess was made for the pieces about symptoms and laboratories, since data given after access requests to public information were duplicated, wrongly structured and badly spelled. Open Refine tool was used for this process and a manual clean-up of files was carried out as well. As regards data analysis, data were structured on an Excel matrix to determine que frequency in symptom incidence and the number of laboratories that processed samples every day. Tableau Public tool was used to visualize data in all these pieces.
What was the hardest part of this project?
For several months, we did not have any open format or real-time information of the pandemic, so we did more than 10 requests for information access (with a minimum delay of 20 days) to access these data, vital for the follow-up of the pandemic and the Government did not give us that access.
In all cases, we found other obstacles that delayed the process when we received the answers. Data provided had serious quality problems: unstructured data, duplicated, spelling mistakes and out-of-date data.
Consequently, at the end of all this process, the out-of-date was evident and obliged us to make a new request of access to information to publish a story with a more updated analysis.
We also had to train ourselves to acquire the necessary skills to understand and analyze health data related to the pandemic. This expertise took time, but also requires flexibility to translate the complexity that arises from national figures into relevant stories for our readers.
What can others learn from this project?
One of the most important aspects that we learned from this project entailed going further concerning data provided by a government, that is to say, examine their quality and use them to monitor public discourses and policies that may hide political interests beyond the public needs. We are also convinced that is fundamental the demand for public information and the work with multiple sources of information; we learned that if any government refuses to give us data, there are other alternative sources where unstructured official information may be collected and reconstructed, such as social media official accounts.
Moreover, we had the cooperation and synergy of the newspaper teams with different skills, who used to work separately. For example, the Business Intelligence area together with the data journalist team working with the health area to process huge information volumes.
The advance of the pandemic and the increase of data complexity and volume defy the team skills: persons with few technical knowledge learned different technology (such as SQL) and those persons who regularly work on data analysis face new challenges to accelerate their processes to agree with demand of information flow.