Trap is one of the most popular music genres among young people in Argentina: of the most listened artists last year, ten of them swung between Rap, Trap and Reggaeton. We used Natural Language Processing (NLP) tools to analyze their lyrics, demystify the concepts around the genre and understand what these artists, who bring about such a furor among the new generations, are talking about. The purpose of this project was focused on attracting and creating an experience for this type of audiences, unusual for LA NACION and also dealing with this topic in a different way for non-specialized audiences.
This special project was undoubtedly risky because we went completely out of our comfort zone. It was our first special using machine learning and dealing with a topic that is not typical in the newsroom.
On the one hand, we set a precedent in using Machine Learning for content production in a large mass medium in Argentina; and on the other hand, we analyzed the lyrics of one of the most important music genres recently in a way never done before in the country.
This was very well received by the community specialized in the subject, especially because we worked together with an independent medium specialized in freestyle and trap called El Estilo Libre and with journalists from the renowned music magazine Rolling Stone. We published this piece together and thanks to them we could have greater reach and dialogue with the community. The special was highlighted and valued by some freestyle personalities in rap battles.
In addition to this news article, different pieces were made in social media: TikToks, Instagram Lives, Instagram TVs and different graphic pieces. These reached more than eighty thousand views in total and the piece was watched by an audience between 17 and 25 years old.
The piece was also mentioned in LatAm Journalism Review, as one of the examples of an IA use applied to journalism in Latin America.
The whole information process was made in Python. Visualizations were made in Vue.JS.
A total of 692 song lyrics were scrapped of the top 20 artists of the genre of Genius platform. After this step, tokenization and lemmatization techniques were applied to calculate the frequency of words used in the songs.
Named Entity recognition was performed for brand recognition with the SpaCy library in Spanish and a manual check was performed. A sample of songs was taken with a 95% confidence interval and it was manually classified word by word under three labels: places, people and brands to compare the entities that were manually labeled against those recognized by the model.
After several repetitions,EntityRuler, a library functionality for adding entities and improving entity recognition, was used. Different Wikipedia pages with lists of tags were also used as model training input.
Thus, the model was run once again on the sample to perform a confusion matrix and determine the F1 score, which was 0.61 in its latest version.
Finally, the universe of all song lyrics was processed and, after the output was achieved, a manual correction was performed on it to achieve an even higher level of accuracy.
Moreover, a technique known as Topic Modelling was applied with the Top2Vec model to recognize the topics covered by the songs.
The Spotify API was used to perform a rhythmic analysis of the genre. Variables such as danceability, tempo, energy and acoustic level were used as input. Based on these variables, the technique known as ‘clustering’ was applied to the KMeans model in 3 groups.
What was the hardest part of this project?
One of the great challenges of this project was the application of techniques from texts written in prose to texts written in verse, with phrases in different languages (English, Spanish, Italian, among others) and words and abbreviations specific to the genre or not found in dictionaries.
In addition to this, we found that Natural Language Processing (NLP) models are not as powerful in languages other than English.
This resulted in much more manual work and a team of more than 10 persons to be involved in the checking and supervision processes of the different techniques applied for more than 4 months. This project was one of the first projects of the team with these technologies, and therefore, it involved a process of learning and testing different models.
As regards the editorial work, it was a twofold challenge: ensure that the piece is aimed at the specialized Trap community and understood by those who are not familiar with the subject. But installing such a disruptive and young issue in a traditional centenarian newsroom, where political or economic issues predominate and where the average age range of the audience is over 40 years old, was not an easy task.
So, we worked with different teams and persons of different ages to find the best strategy for the project and achieve a vast reach on different platforms and audiences.
What can others learn from this project?
The main lessons learned are hidden behind the challenges mentioned above. One of the main lessons learned, that can be passed on to other journalism teams, was the work done with Natural Language Processing (NLP) models applied to texts in languages other than English. That was hard work, trial and error, several repetitions and the search for other solutions to help a better performance of these models. The key was not to discard the idea, but to look for other solutions and connect with other people who had had a similar problem.
This could be done thanks to the interdisciplinary work done between developers, data scientists, designers, network and music specialists. The work between different teams, not only from the same media, was key to have precision in the algorithms and an immersive experience for users.
But it was also a great learning experience for the newsroom to take the risk of creating an experience directly aimed at new generations, with a different language and format to what we were used to.