Innovation (large newsrooms) – year 2020
Co-winner: AP DataKit: an adaptable data project organization toolkit
Organisation: The Associated Press
Country: United States
Credit: Serdar Tumgoren, Troy Thibodeaux, Justin Myers, Larry Fenn, Nicky Forster, Angel Kastanis, Michelle Minkoff, Seth Rasmussen, Andrew Milligan, Meghan Hoyer, Dan Kempton
Jury’s comment: AP’s DataKit is an innovation will change the way many data reporters/editors/teams work and will undoubtedly have a profound impact on the data journalism community at large. Not only is it a tool that can help data journalists work more efficiently more collaboratively, it is a platform that is already being extended by contributors outside of AP. If data journalism is the imposition of structure and reproducibility with a journalistic bent, DataKit promises to be the tool that enforces that structure and enables more efficiency and collaboration for data teams in every newsroom.
Organisation size: Big
Publication date: 12 Sep 2019
Project description: AP DataKit is an open-source command-line tool designed to help data journalists work more efficiently and data teams collaborate more effectively. By streamlining repetitive tasks and standardizing project structure and conventions, DataKit makes it easier to share work among members of a team and to keep past projects organized and easily accessible for future reference. Datakit is adaptable and extensible: a core framework supports an ecosystem of plugins to help with every phase of the data project lifecycle. Users can submit plugins to customize DataKit for their own workflows.
Impact: The AP open-sourced its project-management tool, DataKit, in September of 2019. Our data team has used it internally for two years now on every single analysis project we’ve done. Its purpose is simple, yet sophisticated: With a few command-line directions, it creates a sane, organized project folder structure for R or Python projects, including specific places for data, outputs, reports and documentation. It then syncs to GitHub or Gitlab, creating a project there and allowing immediate push/pull capabilities. Finally, it syncs to S3, where we keep our flat data files and output files; and to data.world, where we share data with AP members. DataKit’s release came at ONA and attracted the attention of roughly 60 or so conference attendees, many of whom returned to their classrooms and newsrooms to try it out. It has been adopted by individual users, by the data analysis team at American Public Media, and is in use in some data journalism classes at University of Maryland and University of Missouri. We’ll have another install party for interested data journalists at NICAR in March. Interestingly, the project has also had several open-source contributions from the journalism community. Several journalists have built additional plug-ins for DataKit — for instance, one coder wrote a plugin to sync data to Google Drive. The impact of DataKit is fundamental: it allows us to move quicker and collaborate better, by creating immediate and standardized project folders and hook-ins that mean that no data journalist is working outside of replicable workflows. Data and code gets synced to places where any team member can find them; and each project looks and acts the same. It creates a data library of projects that are well-documented, all in one place and easy to access.
Techniques/technologies: DataKit is an extensible command-line tool that’s designed to automate data project workflows. It relies on core Python technologies and third-party libraries to allow flexible yet opinionated workflows, suitable for any individual or team. The technologies at the heart of DataKit are: [Cliff](http://docs.openstack.org/developer/cliff/) – a command-line framework that uses Python’s native setuptools entry points strategy to easily load plugins as Python packages. * [Cookiecutter](https://github.com/cookiecutter/cookiecutter) – a Python framework for generating project skeletons Through the cookiecutter templates, DataKit creates a series of folder and file structures for a Jupyter notebook or an RStudio project. It also configures each project to sync to the proper gitlab and S3 locations, and loads specific libraries, dependencies and templated output forms (such as an RMarkdown customized to match AP design style). The AP has built four plug-ins: for Gitlab and GitHub; for S3 and for data.world. Other open-source users have since built additional plug-ins to customize DataKit to their workflows, such as syncing to additional data sources (Google Drive) and outputs such as Datasette.
The hardest part of this project: The most difficult part of the project was creating clear, concise documentation that would help others use our open-source software. We had never open-sourced something so ambitious before, and were put in the position of anticipating others’ uses (we created a GitHub plug-in despite our team not using GitHub regularly) and others’ pain points in understanding, installing and using DataKit. We created DataKit to scratch our own itch — to make our team work better, faster and with more precision and control. Having DataKit means we spend less time every day handling the messy, boring parts of a project — finding old files, creating working directories — and more time on the serious data analysis work we need to be doing. The AP is a collaborative news cooperative, and in that spirit, it made sense this year to fully open-source one of our team’s most powerful tools to share it with others. One of our goals is to make data more accessible to other newsrooms, and DataKit we hope does this by taking away some of the barriers to getting to an analysis and sharing data.
What can others learn from this project: Creating standardized workflows across a data team leads to quicker, more collaborative and stronger work. Data workflows can be notoriously messy and hard to replicate — Where are the raw data files stored? What order do you run scripts in? Where’s the documentation around this work? Is the most recent version pushed up to GitHub? Can anyone beside the lead analyst even access data and scripts? — and DataKit was built to fix that. The thing AP’s Data Team would like others to come away with is that we don’t all have to use these messy, irreproducible and bespoke workflows for each project that comes across our desk. Creating a standardized project structure and workflows creates sanity — through DataKit we at the AP now have an ever-growing library of data and projects that we can grab code from, fork or update when needed — even on deadline. We can also dip into each other’s projects seamlessly and without trouble: One person’s project looks like another’s, and files and directories are in the same places with standardized naming conventions and proper documentation. DataKit simply lets analysis teams work better, and faster, together. One real-life example from 2019: When we received nearly a half billion rows of opioid distribution data this summer, and were working on deadline to produce an analysis and prepare clean data files to share with members, we had six people working concurrently in the same code repository with no friction and no mess. The AP landed an exclusive story — and shared data files quickly with hundreds of members — thanks to DataKit.