In Kyrgyzstan people often said that deputies plagiarize Russian laws. I analyzed all the texts of Kyrgyz laws and found out that it’s true. I scraped all 1277 texts of laws of the last’s convocation of parliament (from 2015).
I removed technical documents such as audit results, reports or renaming of villages. I found that in 40% of 805 new and complementary laws there were articles that are similar to Russian laws.
There is also a chart that shows how many laws each deputie initiated and the average % of his/her copi-paste.
It’s made in Instagram-story format with animated charts.
People became more aware of how untenable was the last convocation of Parliament. Journalists rised more questions about why deputies copy-pasted Russian laws and how it affect on Kyrgyzstan.
It was in top-10 by GIJN and republished by other media outlets
I used Python for scraping, analysis and cleaning. Libraries: pandas, BS4, asyncio, textract, re, selenium, ast etc.
I used text.ru API for finding plagiat in texts, it was excellent at finding plagiarism in consecutive sentences.
What was the hardest part of this project?
The hardest part were cleaning and finding plagiat in 911 712 words in 6248 articles. I made script that worked 2 days straight to pass that amount of text data using API. Due to the fact that laws are legal texts and are often similar, it was difficult to find a minimum percentage of similarity of texts to call it plagiarism. We decided to call the article from the law a plagiarism if it has more than 40% of similarity with the Russian article. Less were minor copies and were not included in further analysis.
What can others learn from this project?
Also it’s not necessary to use machine learning, neural networks and NLP to find plagiralism. There are avaliable APIs and tools, journalists in CIS can try to make similar research in their countries.
My project inspired another media in Kyrgyzstan to make similar visual aproach to tell a story.