startr – a template for data journalism in R

Category: Innovation (small and large newsrooms)

Country/area: Canada

Organisation: The Globe and Mail

Organisation size: Big

Publication date: 4 Dec 2019

Credit: Tom Cardoso and Michael Pereira

Project description:

All data journalists know the pain that comes with starting a new project: hours spent setting up folders, creating files and writing boilerplate code before starting on the analysis. Worse yet, how should you set up your project so others can quickly and easily collaborate and verify your work? Startr, an open-source project built by data journalists at The Globe and Mail, is a tool for the statistical programming language R that streamlines the data journalism process, reducing the amount of time and effort journalists spend setting up and maintaining a project so that they can focus on the analysis.

Impact reached:

Startr has completely changed The Globe and Mail’s data journalism practice, which at this point is mostly focused on R-driven analyses. Previously, journalists all had different ways of setting up their projects and structuring their files, had different coding styles in R, and would often include code that could only be run on their own computer (because they were pointing to a folder that only existed on their computer, for instance). This made sharing work and analysis an incredibly painful process, as a new collaborator would spend upwards of an hour setting up folders, installing packages and rewriting parts of the code so that they could reference files on their own system. With startr and a sister command-line tool, that time has been cut down to just 10 seconds.

Crucially, the project has also made the work of data verification – essentially, fact-checking the analysis and making sure there are no bugs in the code – much more straightforward. Before startr, it was something Globe data journalists dreaded doing. Now, it’s a fast and enjoyable task, since the analysis is structured into discrete steps.

Startr has also streamlined the process of creating web and print graphics. Using R’s own visualization tools, we can now generate Globe-styled complex graphics that are ready to be dropped into Adobe Illustrator for final fine-tuning, either for use with a tool like the New York Times’ ai2html online or to be laid out in print.

Finally, startr has drastically reduced the amount of time we spend on an individual analysis project, since all the upfront setup work and thinking about structure has been abstracted away. An analysis or story that used to take us days to do can now be done in hours, in some cases.

Techniques/technologies used:

As an R project, startr makes heavy use of the “tidyverse” set of packages (ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr and forcats) as well as scraping tools (such as rvest) and other R-specific technologies like RMarkdown, which we use to generate HTML “reports” that can be shared with non-coding reporters.

At its core, startr is essentially a series of folders and R files that orchestrate the data analysis on behalf of the data journalist. The structure is designed around three data “steps”:

  1. Processing: This is where a user imports their source data files, tidies them, fixes errors, applies manipulations and saves out a CSV ready for analysis.

  2. Analysis: Using the files generated during processing, this is where all of the true “analysis” occurs, including grouping, summarizing, filtering, etc.

  3. Visualization: This uses the analysed data gathered from the second step to generate charts.

An optional fourth step, generating reports using RMarkdown, has made it easier to create HTML files full of charts and other insights that can be shared with non-coding reporters.

Analysis reproducibility is a key consideration in startr – in essence, we want two different people running the same analysis on two different computers to come to the same conclusions. To that end, we enforce certain rules throughout the project: no “output” data files from the processing step get saved into version management software like Git, no variables should ever be reassigned, and so on. Our extensive Readme file documents all the “rules” a startr user should try to follow.

Startr also assumes that data will sometimes come from a scraping task, so it comes by default with rvest, a scraping tool for R, a special folder to hold scraping code, and a default Node.js version recommendation if the user chooses to scrape in JavaScript.

What was the hardest part of this project?

Boiling down the analysis process into a simple set of explainable steps took a lot of work. Everyone likes to work differently, and the two creators of this project are no different, so we had to dig deep to find where we had common ground in our approaches to analysis and build up from there. It took a lot of conscious un-learning of bad habits and effort in developing better ones.

We debated the structure of projects, and tested ideas out on some ongoing projects to refine our thoughts before scaffolding what eventually became the startr project. Even then, we learned as we went along, figuring out which packages to import by default, how to handle data being saved into version management tools, etc.

What we ended up with is a structure – both literally, in the sense of folders and files, and philosophically, in terms of our assumptions about the data journalism process – that feels “bulletproof,” and both obvious yet insightful into how data analyses should be done. Our hope is that if we were to explain the project and its philosophy to a seasoned data journalist, they would say “Oh, yeah, that makes a lot of sense.”

The project is still evolving, as we use it on a daily basis. We’re still learning about the template’s limitations, and evolve it on a weekly and monthly basis with new helper functions, instructions in the readme guide, and so forth.

What can others learn from this project?

Standardizing your analysis workflow requires an up front investment, but pays off significantly down the line. With startr, we’ve been able to cut down on almost all of the time we used to spend setting up folders, explaining (and apologizing) for why our code looks one way or the other, and so on.

When it comes to data-driven projects, being organized is key. Any disorganization reveals itself in the final product: either findings that can’t be replicated, or a project that can’t be run from scratch by a new user without serious tweaking.

Project links: