Convoca Deep Data: The most complete data analysis platform on extractive industries in Peru
Organisation size: Small
Publication date: 23 Sep 2020
Credit: Milagros Salazar Herrera, Edwin Montesinos, Luis Enrique Pérez, Asís Loyola, Malena Maguiña, Diego López, Jimmy Salazar, Víctor Anaya, Walter Reyes, Antonio Manco, Francisco Rodríguez, Elvis Rivera, Jackeline Cárdenas Ipenza, Jimmy Pazos, Javier Pereira.
Convoca’s Deep Data platform not only gathered the most sensitive data on the extractive industry in Peru and made it available to the public, but it also published extensive investigative pieces on the extractive industry, including the industry’s dubious labor practices during the pandemic. Convoca’s service goes beyond making data available and revealing unique findings through deep dive investigative reporting: the platform also allows users to cross-reference their own data sets against the platform’s collection and offers extensive training for those who are not familiar with covering the extractive industry from an Amazon country that is crucial to the planet’s sustainability.
Convoca Deep Data is the most ambitious digital platform that Convoca.pe has ever developed. This platform gathers relevant information on the extractive industries of the mining and hydrocarbons industries in Peru.
In order to develop this platform we processed 2.4 million data, and analyzed information that dated back a hundred years and indicators that measure, for instance, the level of non-compliance with environmental and labor regulations in the last 15 years.
Additionally, we processed a 16-year-period of information that contains over 200 open files on environmental matters and a registry of penalties and sanctions imposed between 2014-2019.
Given the quality, quantity, and access of the information, non-government institutions and organizations of the civil society in Peru have contacted Convoca seeking to establish alliances and partnerships. These dynamics are essential to both secure the sustainability of the project in the long term and apply our methodology to other business sectors that register too serious infractions violating labor and environmental protection norms affecting indigenous people.
In order to develop DeepData, our team had to process highly sensitive data that contains information on the mining industry, one of the most powerful economic activities that has brought up several conflicts and disputes over natural resources in Peru.
For developing this platform our team created a traffic light that identifies large oil and mining companies based on the severity of environmental law violations (highly,frequently, moderately, or little infringing). In order to do so, we established a weighted value using statistical methods to indicate the number of environmental violations and the level of severity.
Researchers, scholars, reporters, and specialized organizations are now able, for the first time, to find in an integrated way information about the laboral-environment behavior at mining sites operating in Peru. In our country, open digital platforms containing this kind of processed data are not available; hence, projects like Convoca Deep Data show that it’s possible to provide citizens with integrated and processed data related to the mining and oil industries. To this date near 350 people have subscribed to the platform.
Thanks to this platform our reporters published 10 exclusive investigations between September and December of last year. This investigation also made it possible to disclose relevant information including the number of mining workers with Covid-19 and the mining companies that benefited from government programs like Reactiva Peru that offered economic incentives to companies from different business sectors.
For cleaning, organization and analysis of the data we used different programs: from dynamic tables of Excel, through the statistical processing software R to generate the calculations of the default indicator and SQL for the crossing of the information from the various tables .
The processed information was complemented with the data of the coordinates of the location of the operations of the companies at the national level to georeference them and visualize them in interactive maps generated with the MapBox tool.
In some cases in order to obtain more information we had to scrap government websites, and cleaned and processed the information using R. For data analysis, we used different techniques including dynamic Excel tables, software used for R programming to calculate key indicators, and SQL for crossing information with different tables.
We also used geographical coordinates to georeference and visualize mining companies’ operations across the country using interactive maps developed with MapBox. The platform was developed using the version 9 of Angular in order to have more control over the information. The backend was developed using Nodejs/Express.
What was the hardest part of this project?
Our biggest challenge was the construction of the severity indicator that was built as a traffic light for alerts to measure the degree of non-compliance by companies. We were able to do so based on the number of violations and the level of severity. This traffic light identifies a large company in the extractive industries as highly infringing (in red), frequently infringing (orange), moderately infringing (yellow) and little infringing (green).
To reach that traffic light, as a first step we organized, cleaned and analyzed more than 2,000 files for environmental violations opened from 2004 to January 2020 to mining and oil companies, as well as sanctioning processes for labor rights.
Then we established an indicator taking into account the number of infractions confirmed by government entities in Peru that overview environmental and labor matters, as well as the level of severity for each sanction (mild, serious and very serious).
Subsequently, we established a weighted value using statistical methods that summarize the number of violations and the level of severity. Finally, the list of more than 200 offending companies was grouped into four percentiles (25% each) to establish a ranking by level, which is expressed in the traffic light indicator that Convoca Deep Data shows.
What can others learn from this project?
Journalists can learn that data can not only be used as a source for immediate publications, but also that data can be integrated with other databases to generate open data platforms such as Convoca Deep Data.
At Convoca Deep Data, Convoca.pe has reflected the vast experience of the organization exercising data journalism in the public interest. In order to build different databases our team filed 50 public information requests from government entities, scraped government websites, used tools like OCR to convert PDFs, and analyzed hundreds of documents.
Our work is best characterized for establishing a methodology to develop and codify the traffic light indicator, which accurately describes the degree of behavior of each of the more than 200 extractive companies analyzed as part of the development of the platform.
Establishing a strong section of methodology and understanding data involves interviewing human sources and experts from different fields, and going over hundreds of official documents.
Hence, data-driven projects need to be collaborative and interdisciplinary. To build Convoca Deep Data, a team of 14 people including reporters, analysts and data scientists, technology developers, graphic designers, and audience editors, made possible the development of the platform.