2020 Shortlist

Pratheek Rebala

Category: Young journalist

Country/area: United States

Organisation: The Center for Public Integrity

Organisation size: Small

Cover letter:

I’m a 23 year old data-journalist working at the Center for Public Integrity, a non-profit investigative newsroom based in Washington, D.C. I have been practicing data-journalism since I was 18 years old. I joined the data-team at TIME Magazine during the summer of my freshman year and worked there until I graduated college

I have always worked on small teams, which has taught me how to do more with less. This has also given me the opportunity to learn how to work with every level of the technical stack and all aspects of the reporting process.

At the Center, my job primarily involves working with reporters to identify how data can be used to enrich their investigative projects, and helping report and communicate their findings. Another important aspect of my job involves building tools that can make it easier for state and local reporters to find and report stories affecting their communities. My most recent project the “Copy, Paste, Legislate” tool helps local local reporters identify when special interests and lobbying groups are pushing model legislation in their state legislatures. I hope to build simiartools in the future that will further the Center’s mission to hold people in positions of power accountable to the public.

Like most others in this field, I have largely learned through experimentation and I am incredibly grateful for folks like my mentor (and first editor) Christopher Wilson who took a chance on an 18 year old with no real portfolio. His guidance and mentorship has been extremely critical for my personal and professional development. I hope that I can pay this forward and help other aspiring data-journalists break into this field. Much like the broader industry, people of color are not often represented in the data journalism community. I hope that I can provide guidance and support to other aspiring data-journalists of color. At the Center, I am hoping to make a small impact towards this goal by advocating for a more equitable workplace as a member of our union bargaining committee and the Center’s diversity committee.

Description of portfolio:

My most recent project was part of the Center’s collaboration with USA Today to study the phenomenon of “model legislation” – the practice where legislators introduce bills that are usually written by outside organizations or special interest groups. This project was recently selected as a finalist for this year’s Goldsmith Prize for Investigative Reporting.

The Center had done reporting in the past on groups that sponsored model legislation. However, most of that reporting was based on tips from human sources. Kytja Weir, CPI’s former state politics editor, wanted to see if there was a way to identify model legislation that was sponsored by groups not advertising their work (like ALEC), and automate the process so we could organically identify model legislation as it’s being debated.

The tool I built for this project was used to identify tens of thousands of bills that had un-original language. Our reporters used the ledes from this tool to investigate numerous laws–like those loosening regulations on the sale of defective cars; “religious freedom” laws that promote religious discrimination; laws banning (and allowing) the sale of puppies and bills promoting work-requirements for those receiving SNAP benefits among many others.

This project was a pair-wise similarity problem. For 1.1 million bills, this involved roughly 648 billion comparisons (comparing every pair of bills). We also had to perform these comparisons for every future bill that would be introduced across all 50 states. The scale of this project presented a huge hurdle.

After much trial and error, I developed a system that decomposed every legislation into  “shingles” consisting of 4-7 words each. Then, I generated a MinHash on these shingles and queried those hashes and shingles using elasticsearch. This generated a “similarity score” for each bill. Next, I used a graph clustering algorithm to generate “clusters” of similar bills. This process allowed for discovery of bills that were models but were still heavily modified, greatly limited the work required for each search. Finally, the clustering process allowed for reporters to search for results in an intuitive manner.

This entire process, from start to finish for all 1.1 million bills, can be run in under 8 hours, which allows us to update this database once a week. The final product, is now available for anyone to look for model legislation in their state or around a specific topic.

Another project that I worked on was the Center’s investigation into the oil and gas boom in the Permian basin. I created a succinct way to highlight the magnitude of change this boom was bringing West Texas.

I developed a tool that queries daily satellite imagery around each of the thousands of oil and gas related facilities permitted in the West Texas region. I then used a pixel-diffing algorithm to identify a set of “before” and “after” images for the construction of each permitted-facility. Using this database of images, I created an intro video that showcases the unprecedented magnitude and pace of oil and gas development in West Texas.

For the same story, I also developed other graphics showcasing the lack of environmental monitoring in this region, and how some of the facilities could be contributing to an increase in earthquakes. Finally, another pair of graphics showed how, nearly all of the resources extracted in this region are being shipped abroad, despite being sold to local communities as a means to make the U.S. less dependent on foreign oil.

Project links: