This project reveals which groups of people are easily infected and how Covid-19 was spread in China in early January.
After collecting the data of the 277 confirmed cases and 44 deaths, we analyzed their genders, ages, travel history, clinical symptoms, and underlying health conditions, so as to illustrate the main physical features of those infected. All the visualization of this information is contained in this project.
In the second half of the project, we analyzed the 18 clusters reported as of January 26, to see to what extent the virus had spread from Wuhan to all over the country.
This project reached more than 1 million views after it published on multiple content platforms, including the App named Shanghai Observer owned by JieFang Daily, Toutiao.com owned by ByteDance, and WeChat Official Accounts.
It provided our readers with a basic understanding of the symptoms, susceptible population of COVID-19 at the very beginning of its outbreak when people knew little and there was no previous information about the virus.
After we open the case data, we received requests from universities, newsrooms and many other research organizations.
The softwares we used for this project were Excel and Adobe Illustrator, as Excel for data collection, encoding and cleaning, and Illustrator for visualization.
We referred to the COVID-19 situation reports from government websites, and collected personal information such as residence, travel history, gender, age of daily confirmed.
In the following step, we transformed the collected information into organized data in Excel, with columns being ‘city’, ‘Gender’, ‘Age’, ‘travel to Wuhan'( if yes, input 1, if no, then input 0), multiple symptoms and underlying health conditions ( used 1 and 0 to distinguish Yes or No as well).
Then we visualized the data in the forms of bar charts, pie charts and butterfly diagrams.
What was the hardest part of this project?
The hardest part is to collect and sort out as much information as possible about the patient cases, including their age, gender, activity trajectory, symptoms, past medical history, etc.
Since there was no unified standard for information release in the early stage of the epidemic, the degree of disclosure of non-private information about cases was different, meaning that browsing multiple web pages over and over again was needed to collect information; the formats of the public information varied also so that it was difficult to organize information with a universal set of crawlers and code.
In order to solve the above problems, we shared out the work and cooperated with each other. Each of us was responsible for several provinces and spent nearly six hours a day collecting, sorting, labelling and proofreading these case information. So far, we have kept the data free and public.
What can others learn from this project?
In the early stage of the epidemic, ordinary people lacked the ability to distinguish in the face of complicated information and reports. As journalists, we need to analyze the facts and data to clarify the unfounded speculation, let people know about the truth and relieve panic and anxiety. It is something that journalists should overcome.
When dealing with large-scale unstructured text data, a team with tacit cooperation and clear division of labor is needed. For the extraction of important information, it is necessary to design a set of reasonable and efficient methods, set up indicators covering key information and keep updated in a timely manner.
The most important thing is to stick with it.
One thing worth mentioning is that we have made all the data we collect public. All the readers can get access to the dats through the global COVID-19 dataset we set up by our own. Also, the data enables journalists to generate more COVID-19-related content to inform readers as much as possible about the worldwide pandemic.