The project aimed to make a data-driven coverage of Tokyo 2020. From historical medal comparisons to a zoom on volleyball, athletics and swimming events over the years, we sought to tell complex stories through the crossing of databases. Most data came from Olympedia, a website created by a group of Olympics historians and statisticians. It contains precious details about every Summer and Winter editions. On-the-fly Tokyo 2020 data was collected from the IOC website and was later paired with our historical database to enable a set of stories that otherwise would not exist if all data was not gathered. We
The project represents one of the Olympics largests Brazilian data journalism coverages. The decision of setting up a data-driven coverage reflected in a new Olympics coverage branch that could operate from distance and still bring interesting insights from live events. The project produced 13 articles and numerous insights for in-loco reporters on how athletes were doing in Tokyo 2020 compared to past Games.
The project used Python programming language for scraping and cleaning data from the website Olympedia. Analyses were made in R and visualizations were firstly made on R and then exported to Adobe Illustrator for further editing.
What was the hardest part of this project?
Structuring and pairing historical with live data was the hardest part of the project. The efforts to scrape the historical data from Olympedia and other websites started a few months before the games and consisted on retrieving editions, venues, sports, events and events results data. All 32 past summer editions were scraped together with information about the sports they held and the venues in which they were held. For each sport there were numerous events and for each event there was a result. The project parsed almost 4000 of Olympedia’s pages and half of them were event results. In the Olympics first week the challenge was to retrieve data in the same format as the historical data. We ended up pairing all Tokyo’s sports and events information and some results that matched our interests. Again, hundreds of files were consumed and structured into our database throughout the Olympics.
What can others learn from this project?
The data collecting process was one of the most important parts of this project, as well as the analysis. This endeavour could not be done without web scraping and general programming language knowledge. Applying those skills into a big coverage like the Olympics could shed some light on the potential of data journalism inside a newsroom. It can inspire other journalists to incorporate this skill set into their daily routine, enabling investigations otherwise considered not viable.