The series used a combination of computational techniques and investigative sleuthing to uncover in unprecedented detail how the Chinese government spreads propaganda and disinformation online, both on the Chinese internet and on social media around the world.
Additional publication dates:
The projects had the following impacts:
Twitter released a dataset of accounts linked to Chinese state-backed coronavirus propaganda in June 2020, three months after we published our article, though they declined to respond when we shared accounts with them prior to publication.
After our story about Larry King, Ora no longer allowed King to tape infomercials on the set.
Rep. Jim Banks (R-IN) introduced the Countering Chinese Propaganda Act (H.R. 8286) in September 2020, citing our reporting.
For the first two stories, we scraped millions of interactions on Twitter by suspicious accounts to identify bot networks within them and uncover their origins. Data scraped included tweets, media posted, likes, accounts following and accounts followed. We stored this data in a postgresql database running on Amazon Web Services (AWS). We collected nearly 170GB of data for analysis (about 3GB of account activity and about 167 GB of media).
We used the Twitter API and twint, an open-source Twitter scraping API, to find and scrape millions of interactions by potential bots within the network. We analyzed the data that we had scraped using open source machine learning libraries such as scikit-learn, xgboost and the Chinese language natural language processing library jieba. We did our network analysis of bot account networks using the library networkx and the visualization software gephi.
For the third story, we received a leak of more than 90GB of files, including secret government directives and memos, as well as contracts, documents and other media. We used computer scripts to generate a spreadsheet database from the thousands of directories within the file structure. This allowed our team to read and annotate the files collaboratively.
What was the hardest part of this project?
It was infeasible to examine by hand each of the tens of thousands of accounts we scraped to determine if they are connected to the same scheme. Instead, we used machine learning to identify likely Chinese state-backed fake accounts based on profile information (e.g., account name, twitter handle, account age, profile picture, etc.) and account behavior (e.g., timing, frequency and language of posts, retweeting and liking activity, etc.). We labelled hundreds of known fake and real accounts by hand and used that information to build a machine learning model that could analyze data from an account it had not seen before and assess the likelihood that it was also fake. We then sampled and checked our results by hand. This process allowed us to identify more than 10,000 similar inauthentic accounts with suspected links to the Chinese government.
It is also often difficult to connect a fake network with its ultimate operators. In this case, we combined our data analysis with reporting to link a 2,000 account bot network to a Chinese internet PR company hired by state media.
What can others learn from this project?
Collaborative projects that combine traditional reporting with data can produce unique works with impactful outcomes. These stories combined reporters with diverse skill sets. We integrated our expertise in multiple domains, including investigative reporting and data reporting; data analysis and machine learning; Chinese language, culture and government; knowledge about the tech industry; ability to read and understand code; and cutting edge document analysis tools. The end result was a set of investigations that very few other teams are capable of doing.