The Accountability Project was created as a tool for searching across many, otherwise siloed databases from one place. TAP is an important tool at a time when newsroom resources are scarce and journalists must work more quickly. Our team, part of The Investigative Reporting Workshop, standardizes and curates public data from federal, state and local governments. Formally launched in June 2019, team TAP continues to add data and features. The collection currently provides free access to more than 550 million records from more than 320 databases.
Since launch, we have worked to facilitate stories and research that hold companies, government agencies and people in power accountable. We are working directly with newsrooms, including two nonprofit news organizations and a major metro newspaper, to help them draw from TAP for stories on everything from dark money to insider dealing at nonprofits to tracking money from pharmaceutical companies.
Long-time data journalist the late David Donald conceived of the original idea for TAP, seeing it as necessary for accountability journalism. “The key is the link among databases that provide the connections that allow us to hold the powerful accountable for their decisions and actions,” he wrote in his original proposal.
In building the site, we’ve developed processes for dealing with many large data sets. We developed a system of data checks and standardizations. We’ve trained young journalists and students at American University and other schools in those methods, which they can apply in their careers.
We currently have nearly 3,000 unique users and are in the process of collecting user feedback so that we can continue to make the site even more useful to professional journalists and researchers as we continue to add data.
Note regarding mobile/desktop: TAP was designed to work on mobile and desktop. Currently, most users are accessing the site via desktop.
This is a full-stack web application with a number of moving parts, so we used a combination of open-source technologies to build it – more on that below.
The basic idea is simple: Acquire datasets by FOIA or download, perform minimal data cleaning and field restructuring, upload to a “data lake,” map which fields need to be text searchable, and index into full-text search. We then verify that search results are accurate and publish the data.
Another choice we’ve made is to use “human intelligence” to process and map the data sets. We tested machine classification techniques but have found highly trained humans to be more reliable at this task. At least two people review the data before it goes live.
The data is split into names and addresses, which are searched in ElasticSearch, and what we’re calling transactions, stored in PostgreSQL, which are linked to an entity in ElasticSearch. A transaction might be a campaign expenditure or a voter registration, for example.
Addresses are a key part of the data. They are standardized with open-source implementations of U.S. Postal Service guidelines. We plan to build on this by geocoding addresses this year and enabling geographic searches. We currently include only U.S. data at the national, state and local level, but we see opportunities to apply this to data from other countries. We’re excited about connecting with journalists in other countries who are interested in similar tools.
Our open-source technologies include: Django for the backend and static pages, svelte.js for client-side search and the administrative tool for mapping the data, PostgreSQL to store the “data lake” and ElasticSearch for full text search of entities. More details about the technology are in this article for Source, a community hub for data journalists and news developers.
What was the hardest part of this project?
The biggest technical challenge has been scaling and managing a corpus of data of more than half a billion records and growing our site and user base on a relatively modest budget.
But nearly every stage of this project offered unique challenges. At the onset, we needed to develop a shared vision for The Accountability Project among our team, mapping out an ambitious but manageable path for what we could realistically build with our resources and timeline. This involved identifying parts of our original plan that weren’t reasonable, managing our own expectations, and creating a clear plan for the specific datasets we wanted to make available and searchable in our database.
Many of the databases we’ve included in The Accountability Project required filing open records requests, scraping web sites or extracting data from documents. Reviewing hundreds of data sets with a scrappy crew of fellows and data journalists has meant a lot of hard work to gather, check, and add the data to the site. To help keep things organized and on track, we keep a log of every database from request to upload, including the name of the person responsible for it. We’re software agnostic, as long as users can create a reproducible workflow or script.
Our main challenge going forward is focusing on our audience strategy. We’re thinking of new opportunities and partnerships to expose more journalists to The Accountability Project and establish it as a go-to tool for reporters.
What can others learn from this project?
The Accountability Project was built to help journalists find stories they can’t find any other way. Users will be able to discover relationships between people, addresses and companies, political groups, nonprofits and other entities that they wouldn’t have noticed otherwise because they weren’t previously searchable from one place.
Our experience also can serve to help others interested in taking on similar projects.
Research what else is available. What does your project do that is unique? Find people and organizations that you can partner with instead of recreating the wheel.
Understand and communicate your shared vision from the outset.
Plan for the long, long term. If you want your project to live for a long time, make sure you have a sustainability plan to keep it going. It likely will take longer than you anticipate to launch. We’ve watched plenty of well-intentioned search projects die a slow death from lack of funding because it’s hard to find long-term support. We’re working to lower our operating costs, but also taking steps to make sure the data we’ve gathered helps others.
Have fun. Projects like this are an incredible amount of work. If you can’t have some enjoyment doing them, your product will not be as effective.