2020 Shortlist

Nonprofit Explorer Full-Text Search

Category: Open data

Country/area: United States

Organisation: ProPublica

Organisation size: Big

Publication date: 6 Jun 2019

Credit: Ken Schwencke

Project description:

The IRS publishes millions of XML files with the full suite of information found on a nonprofit’s tax filings, as long as they were filed electronically (which . But a reporter asked if we could search through them to find the names of specific people or companies, and we realized there were no free tools to search through their contents — so we fixed that. We added the ability to search anywhere in the text of more than 3 million 990s, giving researchers, reporters and anyone else the ability to dig deep into these records and unearth hidden relationships between

Impact reached:

We’ve heard from journalists from all ends of the spectrum that this tool has helped them uncover hidden donors and dark money in politics. BuzzFeed used it to find information as disparate as nonprofits with connections to Jeffrey Epstein tp a wealthy conservative whose private foundation lists an investment in The Federalist, helping solve a perennial question. The real depth of the tool’s impact isn’t known, but as the only free tool of its kind, and as one of the most well-trafficked parts of a well-used news app, it is likely to be quite large.

Techniques/technologies used:

For a while, ProPublica has allowed people to search for company names and eventually the names of nonprofit employees from our free app. But at some point we decided: why not dump the entire text of the tax forms into Elasticsearch? So we did just that — took the files, stripped out the XML tags (which make up the bulk of the file size), and dumped them all into Elasticsearch for indexing.

It’s a simple solution, but deceptively powerful. We could have created more structured search engines: for grants, contractors or conflicts of interest. But in the end, giving people the ability to run searches across the whole set proved not just structurally easier, but more versatile.

What was the hardest part of this project?

Processing the entire pile of 990s — which is millions of files, tens of millions of individual forms, and gigabytes on gigabytes in size — is no small task. It takes hours to reprocess from scratch, so formulating a way to create an additive search index (instead of destroying and recreating one, as many elasticsearch indexes do) was a challenge. We had to create a way to be sure that we had an index that was up-to-date at all times, and creating redundancies in case an indexing operation failed.

What can others learn from this project?

I truly believe that the beauty is really in the simplicity: dumping a bunch of text into Elasticsearch is really exactly what it was meant for, and what better than to dump millions of government records that are otherwise not readily searchable?

Project links: