Although there has been great reporting on the under-representation of ethnic minority characters in Hollywood movies, a data-driven look at how they are presented has been missing from the conversation. So, to illustrate how stereotypes have developed in Hollywood, we analyzed the tropes about different ethnic groups present in more than 6,000 Oscar-eligible movies since 1928.
Our analysis and the accompanying reporting show that, even though some of the more obviously racist tropes have faded from cinema in the past decades, many stereotypical narratives have shifted rather than disappeared. This analysis was published in English, German, Turkish and Chinese.
This article has enjoyed an exceptionally ‘long tail’ of visits: it still receives at least as many weekly visits as when it was first published. It also has an average dwell time of more than five minutes. Users generally spend more time with our data-driven stories than they do with other DW content, but this one stood out even among data-driven articles.
The story was also discussed intensively on the DW News Facebook channel. The Turkish and Chinese language teams of DW also adapted the article, prompting discussion on social media in these communities as well.
You can find a full account of our methodology, as well as the data and code behind the analysis, on our GitHub page (see additional links).
The main tool used for this story was the statistical programming language R for scraping, analysis and preliminary visualization.
From the annual Academy Awards “”Reminder List””, as well as the official Academy database of Oscar nominees and winners, we compiled a list of around 28,200 relevant Hollywood movies. We also merged these datasets with IMDB data where possible for further metadata and fact-checking.
For the stereotypes present in those movies, we systematically scraped results from the user-generated wiki TVTropes. As with all user-generated data, there is bound to be some margin of error. Still, TVTropes is the best option for getting detailed and large-scale data on something as complex and subjective as movie tropes. As one of our precautions, we excluded entries that had been only edited by one user.
To match TVTropes entries to the correct movies and avoid false matches, we used a mix of pattern recognition rules and manual work in R and OpenRefine. When in doubt, we erred on the side of caution, and in the end, we had a sample of 6,637 matched eligible movies and 21,789 unique encountered tropes to work with.
In addition to the methodology page in our GitHub repository, we also created an interactive table that allowed users to search through our trope database themselves (see additional links).
The data cleaning software Open Refine was used for pattern-based matching between metadata and TVTropes entries. We also used Adobe Illustrator to adapt the visuals for publication.
What was the hardest part of this project?
Analyses of how minority roles play out on screen has remained largely anecdotal in the past years. That’s because quantitative content analyses are usually expensive and time-consuming, especially at the scale we aimed for. Our approach worked around that by utilizing the best aspects of user-generated data, and avoiding its drawbacks as best possible. Still, the main challenge was building a robust sample, maintaining correctness while including as many movies as possible. In scraping, matching and analyzing the data, we used a mix of automation, pattern recognition and manual work. We took great care in the development of our methodology, and made our process as transparent as possible. Our methodology, as well as the data and code themselves, are published on our GitHub page. This way, we keep ourselves accountable to our audience and help any other interested party benefit from our work. With a final sample of 6,637 matched eligible movies and 21,789 unique encountered tropes, this project offers a scale of analysis that previous projects, both journalistic and scientific, did not reach.
With this project, we attempted to add a new perspective to the discussion around the Hollywood representation of people of minority ethnicities.
Our data-driven approach adds to the qualitative analyses that media experts, and people of color in general, have been publishing for decades. The final article aims to combine the power of data-driven reporting with traditional journalistic work that explains the why, the how, and the real-world implications of stereotypical depictions in Hollywood.
What can others learn from this project?
Since all of our work is available via our GitHub respository, everyone is welcome to use our learnings for their own analyses. Specifically, our approach of using large-scale user-generated data to quantify complex concepts like stereotypes might be helpful to others. We welcome people using and refining the system of checks and balances we devised to check the validity of this type of data as well. Utilizing these methods can open the door to new types of data-driven stories, and provide new perspectives to ongoing important discussions in media and society.