CruzaGrafos is a free software graphic tool for cross-checking and advanced data investigations, by allowing the visualization of relationships in graphs, which allow the interconnection of various information in a kind of web. In its current phase it already has 70 million data, from all companies in Brazil and Brazilian politicians since 2014. It is a project created by Abraji (Brazilian Association of Investigative Journalism) and Brasil.IO (a Brazilian open data hub). Our intention is to transform these huge and difficult-to-access databases into visualizations of power relationships that can be seen by non-data science experts.
It is a work of more than a year carried out by the teams of Abraji and the programmer and transparency activist Álvaro Justen, from Brasil.IO, with the support of the Google News Initiative.
The authors of the project understand that knowledge and understanding of large public databases is one of the ways to improve investigative journalism, especially the data have cross-relationships, context and check.
Many of the databases used originate from the work of Claudio Weber Abramo, an activist for transparency and a pioneer of data journalism in Brazil, who died in August 2018 – who founded the non-profit organization Dados.org together with the journalist and ex President of Abraji José Roberto de Toledo. Abramo’s family kindly gave up the bases he built for Abraji to continue some of his work.
Thus with this experience of Abramo and Toledo, the teams of Abraji and Brasi.IO created a project to explore large Brazilian databases and the graph solution to facilitate access to information.
With this initiative, last year we conducted a pilot course in data journalism and compliance techniques using CruzaGrafos and other databases of public interest, for 80 journalists from all regions of the country.
This year we created an investigative journalism newsletter derived from the project (Investigadora), which already had more than 700 subscribers at the end of January, and has the proposal to show investigative journalism techniques weekly using CruzaGrafos and show recent cases of journalistic investigation Brazil based on evidence. An online training program was also created in January for newsrooms, freelancers, students and third sector organizations focused on open data and transparency – the intention is to show the potential of CruzaGrafos and the possibilities of data journalism to improve the journalistic work.
Technology actions were taken, such as:
– Data processing (partners of Brazilian companies, Brazilian CNPJs – unique company identification code in Brazil -, corporate activities by CNPJ, political candidacies, political donations, health contracts, among other main bases to be selected for launch and over the course of 2021)
– Expanded neighboring nodes and Expanded neighboring nodes by up to 2 degrees have been implemented, allowing you to quickly expand the visualization of the graphs of connections between people and companies – shows the degrees of connection nearby
– The “Save graph” feature was made, which will be very useful during the tests – not only to make life easier for the person testing, but also to help us debug in case of errors
– We built a solution to calculate the “path between objects”, which calculates the shortest path between two people/companies and shows in the graph
– We added a functionality that was not initially planned, but that will help a lot in usability, after tests that we did internally: browse the history of the objects (people and companies) searched
It was also very important all the previous work of scraping the databases and exploratory data analysis with Python, SQL and Metabase to understand the information, clean and prepare it for use in production.
What was the hardest part of this project?
The enormous size of some databases, with tens of millions of rows, required creating alternatives to make the tool both fast and interactive. And we also deal with data with names of people and companies and IDs – they are sensitive data that require a lot of checking controls and to avoid homonyms.
Because of these difficulties, innovations in the project code were necessary. The main innovations in code technology at CruzaGrafos were:
– (1) Entity centralizer: it provides the search for names, companies, municipalities, hospitals, contracts, etc., and gives us the unique universal identifier (UUID). Entities can be: companies, people, applications etc. The lack of a UUID brings problems such as the need to filter through several fields at the same time (which change from dataset to dataset), difficulty in searching in more than one dataset, difficulty in generating the offline ID for external queries, among others
– (2) Graph backend: this is the “heart” of the system, which connects to the previous system for searching and manages queries in the graph bank, API etc.
– (3) CruzaGrafos: here we have the “glue” of everything and it is the most specific part that only matters to Abraji: it is where we have the integration with the authentication of the Abraji associate system, where we have the scripts that feed the two systems above and the interface the user accesses.
The project is open freely to Abraji associates and anyone who signs up as a user.
What can others learn from this project?
This project wants to take advantage of the potential of open data and transparency. In the Global Open Data Index, Brazil is in ninth place, so there is a lot of relevant information for society, but most of the time these files are not machine readable, or very large, or without technical details. This makes the work of data journalism more difficult or restricted to people already familiar with data science.
This way, all the work with the data done by the team will allow thousands of journalists and researchers to have access to this information ready for use. And with graph visualization, it is possible to research power relations among millions of people, companies and politicians.
CruzaGrafos has all Brazilian companies registered with the IRS – 43.9 million companies. Also information on political candidacies, a total of 1.1 million people, according to the Superior Electoral Court (TSE). Thus considering companies, their partners and electoral candidacies, the project now has 70.7 million data.
This creates many possibilities for investigation, such as: Search for all companies linked to a politician/candidate for public office in which he or she is a partner or administrator; In these companies see who are the other partners; Also check the proximity network of these partners, that is, of which other companies they are partners and the other respective partners, in different degrees of proximity; and many others.
Also, throughout 2021, the project will continue to update the IRS and TSE databases and include others of public interest on the environment, public contracts, electoral campaign donations and health. Always with the proposal of allowing crossings with identification keys that can be seen in graphs. As well as our newsletter and training throughout the year, they also intend to spread the Project, its information and research techniques.
youtu.be/ITbbkZlqNGs (tutorial with English and Spanish subtitles)