AzMina, Volt Data Lab, InternetLab and INCT.DD monitored 200 profiles of Brazilian journalists on Twitter. Using a dictionary made up of offensive, misogynistic, sexist, racist, lesbo, trans, and homophobic words, we analyzed 8,300 tweets and found that, while 8% of potentially offensive posts aimed at male journalists were in fact hostile, 17 % of those targeted at female journalists were attacks. Among the most used terms against them are “ridiculous”, “scoundrel”, “crazy”, “little woman”.
We believe that in cases like this, data has the power to empower victims. Many women journalists suffer from cyberattacks but their speeches are minimized and they often fail to grasp the real dimension of this wave of offenses. We talked to journalists who were attacked for the first time who were able to see themselves in these numbers and use them, in their speeches and social networks, to denounce this problem.
The misogyny against female journalists on social networks is already a known phenomenon, but the press was able to reflect it with data. The data from our analysis were republished by national vehicles such as Folha de S.Paulo, Yahoo and Claudia Magazine, and by vehicles that cover the journalistic market such as Portal Imprensa and Portal dos Jornalistas.
We were also able, with this data, to question Twitter about its actions to try to combat misogyny on the platform.
Seeking to build an analysis that would allow us to articulate social markers of gender, sexuality and race, we paid attention to include in the list of journalists who would have their profiles monitored different gender, racial-ethnic profiles and different sexual orientations. The list, which sought to mix journalists with works in several vehicles of the Brazilian press, different regions and, at the same time, in different stages of their careers, included 200 journalists. Among them, 133 women and 67 men.
We collected tweets and retweets that mentioned the monitored journalists and that contained at least one of the words present in our lexicon. The lexicon was built through different phases by linguists, journalists and other specialists, and ended up with 54 terms who included offensive terms, misogyny, racism, homofobia, etc. The collection of the tweets which mentioned these journalists was carried out from May 15 to September 27.
We collected, in the period mentioned, 3,196,086 tweets and retweets mentioning female journalists and 3,886,861 tweets and retweets mentioning male journalists. When we disregard the RTs, we have 2,139,593 tweets mentioning female journalists and 2,509,691 mentioning male journalists. Therefore we collected a total of 7,082,947 tweets and retweets targeting male and female journalists.
Data collection was performed using the R programming language, with the rtweet package (https://github.com/ropensci/rtweet). Searches were made for tweets that mentioned at least one of the profiles of a list of journalists, categorized according to vehicle, genre and other characteristics. The crossing and organization of this content was done in R and SQL, and the dashboards for data analysis were developed in R and metabase. With filters and features we were able to make queries and create databases in CSV for smaller and more specific analyzes, using mostly Google Sheets.
What was the hardest part of this project?
In addition to dealing with the large amount of data, it was also difficult to build the lexicon that would allow us to refine our capture. In the first stage, we updated the lexicon developed for MonitorA, a political violence observatory conducted by InternetLab and Azmina magazine in 2020. As the lexicon had been designed to monitor political violence against candidates, many of the words did not fit in the new research and many others needed to be included.
After testing the initial lexicon, we were able to refine it through qualitative analysis of the first collected tweets. We then included terms that had appeared during the analysis and removed false positives that increased the number of collected tweets.
After the collection was completed, we analyzed the tweets addressed to men and women separately. Since it was impossible to qualitatively analyze all the tweets and retweets mentioned, we chose to analyze only the tweets that had at least 5 likes and/or RTs as engagement. The manual analysis was important to remove ‘false positives’ tweets that could have been incorporated by quoting words that appeared in the lexicon but were decontextualized, and sometimes not actually offensive.
To make sure that there was a common understanding between the researchers of what constituted offenses and what was just criticism, we initially analyzed the first one hundred tweets together. In addition, the tweets that had more complex contexts and could not be easily labeled by just one researcher were analyzed by more than one researcher.
What can others learn from this project?
We really believe that data journalism can learn a lot from the methodology that we have implemented, since 2020, – also with MonitorA, AzMina and InternetLab’s gender policy violence observatory -, of merging quantitative and qualitative techniques to collect and classify data. There are many lessons learned about how to monitor phenomena in social networks using textual corpus – which, we know, are alive, need to be updated and constructed under different perspectives.
It is also possible to learn from the interdisciplinarity of the project, which included the work not only of journalists, but also of lawyers, anthropologists and other specialists.