This data story was created as part of LAMPA Accelerator for Data Journalists. Main goal of the project was to analyze the language used by Russian-speaking opponents of COVID-19 vaccination in Social Media in order to learn what differs their speech from the speech of neutral/pro-vaccination users, as well as to compare the popularity of anti and neutral/pro-vaccination posts in Russian Social Media. We have parsed more than 3 mln vaccination-related posts from the main Russian-language social network “VKontakte”, then classified them into “Antivax” and “Other” categories using machine learning techniques and performed exploratory data analysis on the classified
Our story was published by “Tinkoff Magazine” , which is one of the main russian online media related to business/economics. Posts with the link to the article were also published in “T-M” social media (Instagram, VK, Facebook). To date, the article alone (not counting social media posts) has had 38k views, 115 likes and 530 comments which is one of the top results for 2021 articles published in the same “Statistics” column of “T-M”.
When we started this project, all of the team members were driven by one single aspiration which was to understand better the nature of antivax movement in Russia and the scale of its spread. COVID vaccination is not an easy topic to discuss in our country which is currently divided in 2 almost equal parts in terms of attitude towards Сovid vaccination. It was expected that our article would cause a controversial reaction amongst readers, including some political bias accusations (completely unfounded, as none of the team members are involved in any political/state-related activity in any way, and moreover, 3 out of 4 team members don’t even live in Russia anymore).
We believe that studies like this one aren’t meant to divide us, but to help us all better understand each other and this new era we have been living in for the past 2 years.
First, we wrote a Python script using official VKontakte API to quickly parse vaccine-related posts published between Jan and Aug 2021. It took us 1 week to get more than 3 mln texts from VK. Then we performed some initial cleansing of the dataset, removing short (up to 40
characters) messages and “service” symbols, and randomly extracted a sample with 10000
texts. We manually marked up this relatively small sample (it took us 50 hours) and defined 4 categories of texts: pro-vaccination, anti-vaccination, neutral and irrelevant. We then used
this mapped sample as an input for the supervised learning of the classification models. Already at this point we faced some difficulties in separating “pro-vaccination” texts from the neutral ones : there was only a small number of posts where the vaccine was directly praised. We realized that it would be an issue to analyze differences in the speech of supporters and opponents of vaccination because of the small number of posts with a clear positive opinion on vaccination and of the fact that texts from vaccine supporters were poorly amenable to automatic classification (even manually these texts could be easily confused with neutral speech). We tested and compared some classification models (Naive Bayes Classifier, Logistic Regression and some other) with different mix of parameters, all using TfidfTransformer and Count Vectorizer. Although we tested all models for binary (Anti/Other) and non-binary (Anti/Pro/Neutral) classification, the Logistic Regression binary model on bigrams turned out to be the one that showed bext results, given the above limitations.
“How-to” guide to our work was publlished here:
What was the hardest part of this project?
This project was part of a data journalism accelerator, so it had a very limited timeline (1 month for analysis and another 2 weeks for writing and editing texts and visuals).
On the other hand, all members of the team worked on the project in addition to their primary occupations, meaning we only could dedicate a part of our time to this work (mainly evenings and week-ends). Not to mention that all of us are quite new to Data Journalism.
Due to the lack of time and limited financing, we couldn’t delegate any tasks to speed up the work. It was faster to do it ourselves than to find and manage contractors.
Consequently, the hardest part was to wisely manage our resources, and to make this project a reality.
We didn’t have enough time to do everything we’d like to do with the data we’d exracted. For instance, we had an idea to clusterize the “other” category to try to split it into few meaningful clusters which would allow us a more focused analysis and comparison. But after evaluating the time it would take us to do this work, we decided to skip it and to move on with what we had already (binary classification results), so that we can publish this article at all.
What can others learn from this project?
We believe that our work, even though it’s far from being perfect, is giving any person interested in creating a data story based on text analysis with the use of machine learning, a clear path to follow, as well as it’s highlighting potential difficulties and problems to anticipate/avoid.
After the article had been published, we received some gratifying feedback from other data journalists: both experienced and beginners. The ones who are new to the profession told us that reading this article had inspired them for their own projects.