Wikipedia Abuse Checker

Country/area: India

Organisation: none

Organisation size: Small

Publication date: 21/12/2021

Credit: Shijith Kunhitty


Shijith Kunhitty is a freelance data journalist based in Kerala, India. He has previously worked at Hindustan Times, IndiaSpend and Financial Express.

Project description:

This project tries to find out which Wikipedia pages on India are abused the most.

Abuse of Wikipedia pages is a growing problem in the country, with trolls changing the content of pages around political parties, film actors and cricket players.

Sometimes it’s distortion of facts on a page to fit a preferred narrative, other times it’s outright vandalism as whole sections get deleted.

The level of abuse is determined by checking more than 150,000 Wikipedia pages on India over the weekend, going through their edit histories, and then tweeting charts about it Monday morning.

Impact reached:

To be honest, not much of an impact, I did this more as a personal project. (I talk about it in  a write-up on my website.)

The hope is that people with an interest in online culture, media researchers, casual Wikipedia contributors etc. eventually come across it and find it useful. Access to the project’s tweets should save them the effort of building a similar tool of their own.

The project could even be used as some kind of barometer for online discontent around certain topics. So if specific pages are attracting unusual levels of abuse during the week, we get to know there’s some fervour being whipped up around them online, potentially by bad actors.

Techniques/technologies used:

Everything is automated here. In terms of technology, it’s a python script driving it all. It’s hosted on an Oracle Cloud instance, triggered by the cron scheduler every Friday night. The Wikipedia API is first queried for the edit histories of over 150K pages, this querying can take from 36 to even 50 hours sometimes.

After the edit histories for the last two weeks are downloaded, the script then runs the analysis, figures out what the most abused pages are for that week, and creates CSVs of the top five pages.

It then makes graphics based on those CSVs using the Plotly library, and then the script interacts with the Twitter API to tweet out the graphics Monday morning.

What was the hardest part of this project?

The hardest part of the project was probably trying to determine what the criterion of abuse should be.

Figuring out the criterion of abuse, some kind of quantifiable metric, meant having to manually sift through a year’s worth of edit history for a seed set of 25 pages. I then settled on using tags associated with each edit to build my metric, and checked which tags would be relevant for my purpose.

I realised the number of edits that are reverted on a page can be used as an indicator of abuse, and that all such edits are tagged ‘reverted’, so ended up using that tag in particular.

But why this project should be selected has little to do with how hard it may have been to build.

It’s more to do with its potential as a kind of warning system for the Indian public. That these are the Wikipedia pages online users are abusing, and if it’s something outside the newscycle, it’s likely something strange is going on here. (And if you happen to be a researcher or a journalist, you should look into it.)

The ability to highlight strange patterns is what makes this project valuable.

What can others learn from this project?

For journalists, I guess a learning could be to be more welcoming of open-ended projects like this.

Because this isn’t a project that delivers time-sensitive insights. It’s not a story that loses value a year from now, because the information it’s based on is outdated. As long as there’s a Wikipedia around, this project can keep generating insights every week.

Also the fact that distribution is being done on twitter itself means that there is no website that people have to visit. It reaches people where they are, and if they’re interested, they can follow the @abuse_checker account and get the updates in their timeline. There is no appointment here that people have to make mentally to visit a certain URL every week.

(This isn’t part of the answer, but didn’t know where else to put it, it’s kind of a declaration of interests. I know the two Indian judges well. I have worked for Govind Raj Ethiraj at IndiaSpend, and know Gurman Bhatia too from her stint at Hindustan Times. Will let you decide how to deal with this information.)

Project links: