Sky News has developed a system to automate part of the data process to improve the management and communication of the COVID-19 story.
It automatically checked for updates and gathered new data from a dozen sources every 15 minutes. It also cleaned and re-structured it before storing it in a database.
This database became the starting point of multiple analyses, as well as the source of more than 50 automated visualisations.
This system saved the data team many hours, as well as helped our readers to keep track of the numbers, putting them into context and providing local information.
In a year in which data teams have exceeded their healthy capacity, this system has been crucial for the team at Sky News.
Data sources routinely used for analyses were integrated into this automation process. As soon as a new source became relevant on an ongoing basis, we incorporated it to the in-house database, automating the process of gathering, cleaning and restructuring the data.
Because of this, the most essential and widely used data was “ready to be used” at any time in analyses and visualisations that updated in real-time. That saved the data journalists hours of repetitive work and made the story-telling more visual, creative and efficient.
Automation has enabled us to respond to the high demand for stories and improve the quality of our journalism. The time saved has been invested in researching, collecting more specific data to complement official sources, and facilitate explainers and investigations. This has expanded our knowledge, allowing us to give our audience more comprehensive information about coronavirus and its impact, covering angles not included in official data.
This system has also been hugely beneficial to the wider newsroom, laying the foundations to explore automation and Artificial Intelligence at Sky News. These vital technologies for the future of journalism make newsrooms more efficient, enrich our coverage and help us to personalize stories relevant for people’s lives.
The impact can also be seen in the response of our audience. Millions of users have engaged in data-rich stories and explainers created through this automation project.
The range of what we offered has been extensive, from analyses of the evolution of the pandemic to the impact of the virus in several sectors. And we have presented it at all levels (national, regional, local and postcode) to make the information as personal as possible for our audience.
Data gathered from all the sources required reformatting. But datasets were scattered around and consisted of text published within static pages, CSVs, Excel, Google sheets, PDFs, JSON, XML feeds, and various dashboards that didn’t provide exporting methods.
We built a cloud-based solution using FaaS (Function-as-a-Service) microservice architecture, where each atomic service handled specific tasks to detect changes in the source and extract, format, clean-up and store the data.
Vision and Document artificial intelligence services managed data extraction in places where it couldn’t be exported: public dashboards, images, and unstructured files.
The extracted data was stored in Big Query tables and automatically updated multiple data feeds which were load-balanced across global content delivery network.
As some of the information wasn’t published in a machine-readable format, but given in press conferences or government documents, we created spreadsheets in Google Sheet and Microsoft Excel in which journalists – even without technical knowledge – could easily import the information that was linked to our in-house database.
Using R programming language, the data journalists could then easily query the database for stories. As the data kept the same format, we wrote several scripts that have been used and re-adapted multiple times to speed up the process, allowing more time for research and to improve the presentation and the design of our stories.
What was the hardest part of this project?
Consistent data was a key challenge since the beginning of this project. Sources have changed the structure of the data and their sharing methods multiple times, as well as backdating the changes without notification. To ensure accuracy, this system has been in constant revision.
Access to some data has proved difficult on multiple occasions. Some institutions did not allow data to be downloaded, but just explored it in their dashboards. We reached an agreement with some of them for a downloadable file, but we also had to develop complex scrapers to get daily numbers for some of the most basic variables.
The lack of a common body or place which united all the relevant information, especially at the beginning of the pandemic, increased the time required to identify the sources and the relevant data in each of them. And, as there isn’t a common standard for how the information is published, the cleaning and formatting process was quite laborious.
All these data challenges could have been better handled with time, but the urgency of the story, together with the relevance of breaking news at Sky News, increased the pressure on the project and the need to complete each stage quickly, without affecting the quality of the product.
The fast-moving and changing nature of the COVID-19 story has also contributed to the pressure. We have been constantly revising, adapting and incorporating new sources and variables to the process.
It should also be noted that the two people involved in this project could not work on it exclusively but had to combine it with other responsibilities. Prior to the start of the project, there was also no proper data structure and architecture in the newsroom, and no established data team. This was created months later.
What can others learn from this project?
Investing time and resources in automating processes and reducing repetitive tasks greatly improves the efficiency, efficacy and quality of the product. The benefits are particularly relevant to small teams, which could not otherwise compete with larger ones. It also helps them to react to the pressure and needs of the newsroom in a demanding and fast-changing context.
Although big and multidisciplinary teams can build complex systems, small teams can benefit hugely from automation, identifying time-consuming repetitive tasks which would have an important impact on journalists’ efficiency and their stories if done automatically.
The lack of technical resources is usually the main inconvenience, especially in small teams. Working on this project, we have learned that identifying the right people within the company and collaboration are the keys.
Editorial teams normally lack technical professionals, but they are embedded in bigger groups, teams or companies. Building synergies with other departments within the newsroom or within the company can be a solution to that shortage of technical knowledge.
Data journalists are usually a bridge between the editorial and the technical teams. They have more technical skills and, although they are not at the same level as the developers, they have more knowledge and understanding of the requirements than the editorial team. This contributes to more fluent communication and smooths the delivery of the project.
This has been an innovative project for Sky News, something the organisation has never done before. Developing it within the company instead of externally has also been crucial. It has given Sky employees the opportunity to expand their knowledge and skills and it has created the know-how within the company. Now these skills exist in the team, they are being re-used on other projects and even expanded as we continue to explore automation and AI in the newsroom.