How Did Twitter Identify Large Clusters of Accounts Amplify Messages Related to The Hong Kong Protests?

Category: Innovation (small and large newsrooms)

Country/area: Taiwan

Organisation: READr

Organisation size: Small

Publication date: 16/09/2019

Credit: Producer: Chien Hsin-chan Journalist: Lee Yu Ju Design: Chen Yi-Chian Data: Chien Hsin-chan, Hung Chih-Chieh

Project description:

On August 19, 2019, Twitter published for the third time the behavior of the political cyber army they monitored, and publicly deleted the profile of the account. We try to figure out the activity characteristics of deleted accounts by machine learning, to speculate on the possible reasons for Twitter to delete these accounts, and to find a way to identify the “cyberarmy”.


We found that the deleted twitter accounts did have some special behavior characteristics. This is the first time in Taiwan that media has applied machine learning to news reports.

Impact reached:

On the published data, we can use a scientific method to deduce the reason behind the data, and use the result and insight as a report, so that readers can have a better context to understand the circumstances under which Twitter was deleted these accounts, and how they define them with the support of Twitter materials, and find out what kind of patterns the Cyber Army has discussed in recent years, and what they will actually look like.This report is not just about data visualization or simple analysis to find insights, but through a greater amount of data and machine learning methods, using computer science to return to a more realistic situation.

This also allows us to find that in the field of data journalism, in fact, there are more analysis methods available to do the report. This allows data journalism not only to find correlation or cause and effect through cross-analysis of the data, but to drill down into the data to find more facts.

Techniques/technologies used:

The most important part of this topic is the machine learning. In machine learning, how to find the suitable model is the most important. So we tried several basic and important models, including Decision Tree, Xgboost, and Randomforest. At the beginning, the data must be cleaned to be usable in the Python sklearn module, and the fields that we think are important and can be analyzed are used as learning characteristics.

In Twitter ’s published of deleted accounts, it includes two groups of accounts and all content posted by users in the two groups of accounts were published. So we can take one group as the object of learning and the other group as the object of verification. Of course, there needs to be a sufficient number of random Twitter accounts and the published content of these accounts as a control group for learning and verification. To ensure that the model we use has good results with the parameters set for the model.

What was the hardest part of this project?

In order to better achieve the effect of machine learning, it is really important to find a sufficient number of random Twitter accounts and the content that these accounts have published. Because if there is no way to have a sufficiently random account, under the mechanism of machine learning, it will make the computer too easy to find the corresponding features for grouping. And compared to the 940 accounts published by Twitter, we also need to ensure that we can find a close number of accounts so that machine learning can have enough data to learn.

At present, more complete Twitter account information and published content are probably on the Internet’s AI data platform Kaggle, but most of the datasets on Kaggle will generate data with certain specific attributes and insufficient data fields. Finally, we found that the two data sets have complete fields, so we used some of the data from these two data sets, and obtained the Twitter accounts of some public figures, and then went to grab the postings of these accounts, following After the data set is mixed, it is divided into training group and verification group. Only in the end can a better training result be achieved.

What can others learn from this project?

At present, for most data journalism, we are very focused on comparing data sets, or finding correlations between different data sets, trying to find the cause of these data, or understanding the various results obtained under conditions. In some cases, it may be that more computer science methods (such as machine learning) can be used to actually find closer to the facts and actually affect the cause of the results. Although in general, AI and machine learning both use existing data to predict or achieve one of the results we expect. But machine learning can also be used as another kind of reverse engineering, telling us under what conditions, the final results will become those we already know.

Therefore, although we are not using the general application of machine learning to predict future results, how to make good use of these different technologies to complete more scientific reports is also what we can learn from this process.

Project links: