Dolly Setton

Entry type: Single project

Country/area: United Kingdom

Publishing organisation: The Economist

Organisation size: Big

Publication date: 2022-01-29

Language: English

Authors: Dolly Setton, Olivia Vane, Rosamund Pearce, Idrees Kahloon


Dolly Setton – Data journalist with The Economist. Previously a senior editor at Natural History magazine, and a reporter and editor with Forbes Inc. She holds an MA from Columbia University in quantitative methods, with a data science focus.
Olivia Vane – Interactive visual journalist with The Economist; previously Research Software Engineer at the British Library; holds PhD from the Royal College of Art.
Rosamund Pearce – Visual journalist with The Economist; previously produced multimedia for Carbon Brief; holds an MSc in Science Communication from Imperial College London.
Idrees Kahloon, who edited the story, is the Washington Bureau Chief of The Economist

Project description:

Analysed weekly top 100 music hits across 70 countries over five years to track the evolution of musical tastes and linguistic preferences. Built scrapers and web crawler to gather data; assessed musical kinship between countries using clustering model and Principal Componet Analysis. Determined that countries clustered into three groups: one in which the English language dominates; one where Spanish prevails; and a third cluster that mostly enjoys songs in local tongues. The analysis revealed that the hegemony of English is in decline. Innovative interactive matrix visualization allows readers to grasp international patterns and hear top songs from around the world.

Impact reached:

The story won a 2022 Information is Beautiful Silver Award, beating out hundreds of other entries. The winners are determined by a panel of experts and a public vote. Winning the award increased the story’s visibility and also, by extension, its ability to inspire other journalists and engage members of the public.

When the story came out, it was one of The Economist’s most popular in terms of reader views. It also enjoyed unusually high average engagement times. The story continues to periodically enjoy high view metrics even now, a year-and-a-half since the story first appeared.

My analysis revealed that fears that globalisation would lead to a worldwide musical monoculture dominated by English-language hits have proven wrong. In fact, local music culture is on the rise across the non-English world. That insight inspired one of my colleagues to examine this proposition for pop culture more broadly, in film, television and Tiktok. She wrote a piece in October 2022 called “How pop culture went multipolar” which extends my thesis to those realms, for which I contributed additional analysis based on my earlier work.

Techniques/technologies used:

Data gathering:
I scraped weekly top 100 hits over five years across 70 countries from Spotify’s website; artist nationalities from popnable; genres from SoundCharts and music lyrics from Musixmatch and GeniusLyrics. I used the following Python libraries: Selenium, Requests, Beautiful Soup, tqdm, json,Traceback and re. These are all tools that are useful in complex scraping projects. Tqdm, for example, estimates how long it takes for code to run and provides a progress bar; Traceback offers insight into errors; Selenium automates web browsers.

I also implemented an automated translation process using a library called googletrans that runs on Google Translate to determine lyrics languages.

I used a technique called Principal Component Analysis to reduce the dimensionality of my data. I then performed a clustering analysis using a machine learning algorithm called K-means to gain insight into musical similarities between the countries over this time period. Other python libraries used were StandardScaler, pandas and numpy, all foundational to data analysis and machine learning.

To make data visualizations to help gain insight into my data, I used matplotlib and seabor. I also used bootstrap sampling to estimate language identification error rates and maks sure they were minimal.

Context about the project:

It was time-consuming and complex to gather and clean this data given that over 300,000 records and many languages were involved. Lyrics take far longer to scrape than numbers; copyright issues precluded music lyrics APIs from posting enough lyrics or their search engines were overspecified; and there were various nuances that needed to be recognized to make sure the data was properly cleaned (e.g. songs with identical titles and artists might be in different lanaguages).

Developing a webcrawler that prioritized two lyrics sources was my solution to the API limitations. Also, given the number of server requests, I had to get around blocks to my scrapers.
I did a bootstrap error rate analysis to help refine my approach.

What can other journalists learn from this project?

It’s worthwhile to create original datasets to gain unique insights. Gaining expertises in tools beyond Excel and algorithms beyond regression can also open up new possibilities and perspectives, and help maintain long-term engagement by readers with your work.

Data visualization is valuable in making work engaging and accessible, as is choosing a “fun” topic like music or connecting to a larger theme like globalization.

Project links: