Sounds of the stadium: Anatomy of a World Cup chant combined data visualisation and analysis with audio to look at the linguistic and musical traits of spectator chants from countries in the World Cup. The interactive provided the means to explore individual chants, and compare one country’s chant to another.
The project allowed our team to combine data visualisations and audio for the first time, while experimenting with text analysis, natural language processing, and machine learning to reveal insights into the linguistic and musical relationships between spectator chants. It formed part of a package of World Cup interactives that were focused on encouraging adaquate engagement and exploration opportunities to get readers to return multiple times.
The interactive took a simple question, “”What makes a good fan chant?”” and used the features of each chant to explore this topic in an engaging and visual way.
The project can be thought of in terms of two processes: data analysis and data visualisation.
1. Data processing and analysis: Most of the heavy lifting occurred in the data analysis stage of the project. The music for each chant was transcribed manually, while the text for the lyrics was already available. When processing text, several natural language processing tools were used, all of which were implemented in Python. These included text cleaning, a process for which we used Python’s Natural Language Toolkit and regex to remove punctuation, unicode characters, and stopwords—words that occur at high frequencies and don’t contribute to the overall meaning, and to reduce words to their base forms. The cleaned text was then transformed into vectors using word embeddings, in this case TF-IDF. The dimensionality of these vectors was reduced using UMAP and the results were passed to a clustering algorithm (HDBSCAN) to find chants that could be grouped together. Several metrics, including Hapax scores—a measure how unique the language in a chant is—as well as parts of speech density were used to get a general sense of complexity. When processing the musical data, after the chants were transcribed and converted to MIDI, pitch and rhythmic data was extracted in Python. Lyrics were then associated with each note, and the rhythm was normalised by the longest note duration for each chant.
2. Data visualisation: The interactive was built using our in-house Vue.js template, along with D3. With regards to the data visualisation, we wanted to encode music in a way that didn’t require readers to know how to read musical notation. These visualisations are less accurate than traditional notation, but are designed to provide a view of the essence of each chant.
Context about the project:
The project required a substantial amount of data processing, which took up the majority of the production time. To create a large enough corpus of text for the clustering algorithm to be effective we collected, cleaned, and analysed many more chants than were included in the story. This gave the algorithm more training data on which to generate clusters. Also, the algorithm doesn’t provide the reason why chants were grouped together, so we had to go through each chant, and map the word embeddings to important words in each cluster to identify the reasons chants were grouped into different clusters. These were then used to explain the similarities between chants in the story.
In addition, we experimented with various complexity metrics for each of the chants, calculating the hapax legomenon and parts of speech density, among others. Eventually, we went with a complexity metric that combined the length of a chant with the number of unique words, which we thought would result in an index that would be easier for readers to understand.
In addition to processing the linguistic features, the audio data also needed to be processed. Each chant has its associated audio, and the music from the audio was manually transcribed, converted to MIDI, and then numerical data was extracted from MIDI so we could create a visual representation of the chant. The transcription process was one of the most time consuming parts of the development of this piece. We did try to automate the process using deep learning to convert audio directly to MIDI, but this process failed due to low fidelity and background noise in the audio files, even after the recording had the majority of the noise removed.
What can other journalists learn from this project?
We tried to approach this project from a point of view of using fun animations and anecdotes in order to draw readers into more serious data visualisations. We took the story of the famous “Ole ole ole” chant and animated it to pique interest, only then going on to looking at the data. Before finalising this way of tackling the story we had experimented with data-first approaches to the introduction of the piece, none of which really worked. Given the non-serious nature of the piece, leading into the data story seemed to work.
Additionally, there is a lot of data that can be extracted from audio if you know how to process it. One valuable lesson we learnt from the process of developing this interactive is that experimentation is key, and with a little effort, the processes that occur when implementing things like machine learning can be understood and verified. Even if the results of your initial experiments aren’t accurate, they provide a base from which to think about similar problems in future.