Public data is everywhere, collected by government agencies, advocacy groups, news organizations and an ever-expanding set of specialty sites. But that data is in too many different locations to easily search. The Accountability Project cuts across these data silos and gives journalists and the public a simple way to search huge volumes of public data about people and organizations.
So far, we’ve acquired, standardized and uploaded hundreds of databases, accounting for more than 1.5 billion records. The data has been used by journalists across the country to identify patterns and draw connections between people and organizations.
The goal of TAP was to provide research data to newsrooms and researchers that might otherwise not have access to such data. We have worked with several nonprofit newsrooms to help them with investigations using TAP data, including stories about dark money, nonprofits and health care.
We have nearly 20,000 unique users with more than 400 sign-ups for the data that requires a logon. Because the site is free, it is difficult for us to track how individual searches generate stories, but anecdotally, users say it is a valuable resource.
TAP is built on Amazon cloud technologies: an EC2 web server, a Postgres Amazon RDS database, as well as an Elasticsearch account with [elastic.co](https://elastic.co). Our data is stored as a data lake, with only a subset of fields as native Postgres columns. The full data is kept in JSONB.
The website itself is built with Django, and an administrative tool allows contributors to tag and standardize the data fields, including names, addresses, dates, and other key fields. The search pages are built using Svelte.js.
The data is split into entities and transactions. Names and addresses are searched in Elasticsearch and transactions are stored in PostgresSQL.
Data is tested on a staging site, so we can check that search results work as intended before publishing to the live site.
Many of the TAP’s databases required filing open records requests or extracting data from documents. Reviewing hundreds of data sets with a scrappy crew of fellows and data journalists has meant a lot of hard work to gather, check, and add the data to the site. To help keep things organized and on track, we keep a log of every database from request to upload, including the name of the person responsible for it. We’re software agnostic, as long as users can create a reproducible workflow or script, but most of the data processing is done in R or Python.
Context about the project:
We took on what seemed like an impossible task.
Every stage of this project offered unique challenges. At the onset, we needed to develop a shared vision for The Accountability Project among our team, mapping out an ambitious but manageable path for what we could realistically build with our resources and timeline.
Every TAP data set originated in a different format and under different open records laws. Very few of the data sets we included were readily available online. We had to write scraping tools to pull data from government websites and even, in some cases, documents. We faced long negotiations with some agencies over access or excessive fees.
Once we acquired the data, the team had to develop creative solutions to scraping and standardizing hundreds of different data sets.
Now that CPI has taken over TAP, we are seeking ways to connect with local newsrooms who often lack research resources. We’re exploring new opportunities and partnerships that could help expose more journalists to The Accountability Project and establish it as a go-to tool for reporters.
At the same time, we plan to add new features to the site, including geographical searches and the ability for users to run their own data against what is on the site. We’re also planning to conduct more user experience research to get a better understanding of how people use the site and how we can improve its functionality.
What can other journalists learn from this project?
TAP was built to help journalists find stories they can’t find any other way. Users will be able to discover relationships between people, addresses and companies, political groups, nonprofits and other entities that they wouldn’t have noticed otherwise because they weren’t previously searchable from one place.
We hope our experience in developing the tool also can serve to help others interested in taking on similar projects.
1. Research what else is available. What does your project do that is unique? Find people and organizations that you can partner with instead of recreating the wheel.
2. Understand and communicate your shared vision from the outset.
3. Plan for the long, long term. If you want your project to live for a long time, make sure you have a sustainability plan to keep it going. It likely will take longer than you anticipate to launch. We’ve watched plenty of well-intentioned search projects die a slow death from lack of funding because it’s hard to find long-term support. We’re working to lower our operating costs, but also taking steps to make sure the data we’ve gathered helps others.
4. Use a structured workflow. Even if your workflow changes, it will keep everyone on the same page in the direction of success.
5. Have fun. Projects like this are an incredible amount of work. If you can’t have some enjoyment doing them, your product will not be as effective.