Data-driven Tokyo 2020 Olympics coverage

Country/area: Brazil

Organisation: Folha de São Paulo

Organisation size: Big

Publication date: 23/07/2021

Credit: Adalberto Leister Filho, Daniel Mariani, Diana Yukari, Fábio Takahashi, Leonardo Diegues


Adalberto Leister Filho Former sports journalist at Folha de São Paulo. Previously worked at other large media companies in Brazil such as O Globo and CNN.

Daniel Mariani Data journalist at Folha de São Paulo. Graduated in Biological Sciences from the University of São Paulo.

Diana Yukari Information designer at Deltafolha, Folha de São Paulo’s data journalism team.

Fábio Takahashi  Former editor of DeltaFolha, Folha de São Paulo’s data journalism team. Founder of Jeduca (Association of Education Journalists).

Leonardo Diegues Data journalist at Folha de São Paulo. Social Scientist by the University of São Paulo.

Project description:

The project aimed to make a data-driven coverage of Tokyo 2020. From historical medal comparisons to a zoom on volleyball, athletics and swimming events over the years, we sought to tell complex stories through the crossing of databases. Most data came from Olympedia, a website created by a group of Olympics historians and statisticians. It contains precious details about every Summer and Winter editions. On-the-fly Tokyo 2020 data was collected from the IOC website and was later paired with our historical database to enable a set of stories that otherwise would not exist if all data was not gathered. We

Impact reached:

The project represents one of the Olympics largests Brazilian data journalism coverages. The decision of setting up a data-driven coverage reflected in a new Olympics coverage branch that could operate from distance and still bring interesting insights from live events. The project produced 13 articles and numerous insights for in-loco reporters on how athletes were doing in Tokyo 2020 compared to past Games.

Techniques/technologies used:

The project used Python programming language for scraping and cleaning data from the website Olympedia. Analyses were made in R and visualizations were firstly made on R and then exported to Adobe Illustrator for further editing.

What was the hardest part of this project?

Structuring and pairing historical with live data was the hardest part of the project. The efforts to scrape the historical data from Olympedia and other websites started a few months before the games and consisted on retrieving editions, venues, sports, events and events results data. All 32 past summer editions were scraped together with information about the sports they held and the venues in which they were held. For each sport there were numerous events and for each event there was a result. The project parsed almost 4000 of Olympedia’s pages and half of them were event results. In the Olympics first week the challenge was to retrieve data in the same format as the historical data. We ended up pairing all Tokyo’s sports and events information and some results that matched our interests. Again, hundreds of files were consumed and structured into our database throughout the Olympics.

What can others learn from this project?

The data collecting process was one of the most important parts of this project, as well as the analysis. This endeavour could not be done without web scraping and general programming language knowledge. Applying those skills into a big coverage like the Olympics could shed some light on the potential of data journalism inside a newsroom. It can inspire other journalists to incorporate this skill set into their daily routine, enabling investigations otherwise considered not viable.

Project links: