Public data is everywhere, collected by government agencies, advocacy groups, news organizations and an ever-expanding set of specialty sites. But that data is online in too many different locations to easily search. The Accountability Project cuts across these data silos and give journalists and the public a simple way to search huge volumes of public data about people and organizations. So far, we’ve acquired, standardized and uploaded hundreds of databases, accounting for more than 1.4 billion records. The data has been used by journalists across the country to identify patterns and draw connections between people and organizations.
The goal of TAP was to provide research data to newsrooms and researchers that might otherwise not have access to such data. We have worked with several nonprofit newsrooms to help them with investigations using TAP data, including stories about dark money, nonprofits and health care.
We have nearly 20,000 unique users with more than 400 sign-ups for the data that requires a logon. Because the site is free, it is difficult for us to track how individual searches generate stories, but anecdotally, users say it is an valuable resource.
TAP is built on Amazon cloud technologies: an EC2 web server, a hefty Postgres Amazon RDS database, as well as an Elasticsearch account with elastic.co. Our data is stored as a data lake, with only a subset of fields as native Postgres columns. The full data is kept in JSONB.
The website itself is built with Django, and an administrative site lets our data fellows pull in datasets from S3 and tag the columns containing names, addresses, dates, and other key fields. The search pages are built using Svelte.js.
The data is split into names and addresses, which are searched in Elasticsearch and transactions are stored in PostgresSQL. A transaction might be an expenditure or a registration.
We then load data into a staging site, so that we can check that search results work as intended before publishing to the live site.
Many of the TAP’s databases required filing open-records requests or extracting data from documents. Reviewing hundreds of data sets with a scrappy crew of fellows and data journalists has meant a lot of hard work to gather, check, and add the data to the site. To help keep things organized and on track, we keep a log of every database from request to upload, including the name of the person responsible for it. We’re software agnostic, as long as users can create a reproducible workflow or script, but most of the data processing is done in R.
To allow users to query some data directly, we employed Datasette, along with query templates, so users could write their own SQL to run against TAP data.
What was the hardest part of this project?
At the outset, we took on what seemed like an impossible task.
Every stage of this project offered unique challenges. At the onset, we needed to develop a shared vision for The Accountability Project among our team, mapping out an ambitious but manageable path for what we could realistically build with our resources and timeline. This involved identifying parts of our original plan that weren’t reasonable (e.g. how much data we could reasonably wrangle), managing our own expectations, and creating a clear plan for the specific datasets we wanted to make available and searchable in our database.
Our main challenge going forward is getting folks using the site regularly and using it as a resource for accountability stories. We’re thinking of new opportunities and partnerships that could help expose more journalists to The Accountability Project and establish it as a go-to tool for reporters.
At the same time, we plan to add new features to the site, including geographical searches and the ability for users to run their own data against what is on the site. We’re also planning to conduct more user experience research to get a better understanding of how people use the site and how we can improve its functionality.
What can others learn from this project?
We learned a lot through this process. For anyone else looking to take on similar projects and build new resources and tools for journalists, we have a few points of advice.
• Research what else is available. What does your project do that is unique? Find people and organizations that you can partner with instead of recreating the wheel.
• Understand and communicate your shared vision from the outset. We had several key personnel changes with our project, so maintaining consistency and making sure everyone had the same idea of what to expect was challenging.
• Set agreed-upon standards and structures to keep everyone’s work consistent with the end goal of the project.
• Plan for the long, long term. If you want your project to live for a long time, make sure you have a sustainability plan to keep it going. It likely will take longer than you anticipate to launch. We’ve watched plenty of well-intentioned search projects die a slow death from lack of funding because it’s hard to find long-term support. We’re working to lower our operating costs, but also taking steps to make sure the data we’ve gathered helps others.