More than 60 datasets related to the incidence of COVID-19 in Spain. Extraction, cleaning and standardization of the tables of the accumulated daily situation of the coronavirus disease SARS-CoV-2 (COVID-19) in Spain in an accessible and reusable format carried out by DATADISTA, a medium specialized in investigative journalism and data.
In Spain, the Ministry of Health and other government ministries are publishing daily official data on COVID-19 and other tables with related information in PDF. Since March 12th, DATADISTA has been carrying out ETL processes to different Spanish public sources. Due to the lack of transparency with the file extension and the absence of a database, we decided to start publishing this data in an accessible and reusable format on our GitHub account. We currently keep 29 datasets active although the total number of published datasets rises to 65. DATADISTA datasets have become a worldwide reference on the impact of COViD-19 in Spain. The datasets we publish have also been used in at least half a dozen scientific papers.They are used by dozens of organizations, media, analysts, and epidimeologists around the world. From the Computational Biology and Complex Systems Research Group (Universitat Politècnica de Catalunya) to Spanish national media such as El País or El Confidencial. They are also republished at Harvard Dataverse.
At first we started to carry out this work with Tabula,Open Refine and Fine Reader. This took us a few hours of work a day, which made its maintenance unfeasible over time. Therefore, we decided to program different scripts developed with the python programming language using data science libraries. These scripts are executed manually through Jupyter Notebook because the reports usually contain changes in the positions, sizes and columns of the tables, which forces us to have to make daily modifications to the scripts for their correct execution.
What was the hardest part of this project?
One of the main problems we have faced is due to the lack of standardization of the daily reports due to changes in the official reports throughout the pandemic. The formats have also been modified, and every day the sizes of the tables are different. Therefore, it has not been possible to implement automatic extraction and standardization processes. Currently we manually generate different scripts (whith Jupyter notebook) programmed in python to extract, consolidate, analyze and normalize the data from the different tables and add them to the historical series that must be modified on the fly in the workflow that we have organized. Then we have scripts that automatically generate messages for social networks, graphics and upload the new csv to our GitHub account.
What can others learn from this project?
We believe in journalism as a public service. Since the beginning of the pandemic in Spain, we have seen that the lack of accessibility and standardization in the data prevented society from being able to analyze the incidence of COVID-19 in Spain. It is a job that we do internally for our own articles, visualizations and animations, so we decided to release these datasets to allow their free and open use. The impact we are most proud of is seeing how our datasets are serving to do science.