The Guardian reviewed EPA records on 140,000 public water systems across the US and cross-referenced them with demographic data to determine which communities are most exposed to contaminated drinking water.
The investigation revealed that water systems serving counties with large Latino communities violate contamination rules at twice the rate of those serving the rest of the country. Low-income and rural counties were also found to be at a higher risk.
The data investigation was then followed up by on-the-ground reporting from California’s Central Valley which was identified as a hotspot for nitrate contamination from industrial agriculture.
To our knowledge, our analysis was the first time that the disproportionate impact of drinking water contamination on Latino communities was shown at a national level.
The main news article and the dispatch from Central Valley were viewed more than 100,000 times on the Guardian website (this is relatively high for a US story that is not promoted on the Guardian’s UK or international fronts). The main article was also translated into Spanish and published in La Opinión, the largest Spanish-language newspaper in the US. The English language version of the article was republished in Consumer Reports, broadening the audience.
On social media, the story was picked up by a range of environmental groups, including the Sierra Club, The Water Desk, and the Environmental Protection Network.
All data analysis was conducted in Node.js. I started from a dataset of violation points by water system, of which there are more than 140,000 in the US. These were not always associated with a county code, so I wrote a script to assign county codes to the unmatched systems, for which only the name of a town/city and state was known.
Another important step of the data clean-up was matching specific violation rule codes to broad categories of contaminants.
We then investigated numerous possible ways to analyse and present the data – from regression analysis to multivariate maps, exploring a range of possible demographic and socioeconomic factors that might be associated with poor water quality.
The final maps were created in D3.js, drawn to a HTML canvas and exported as high-resolution PNGs to be used in a “scrollytelling” interactive, faded in and out with CSS animations.
What was the hardest part of this project?
One of the biggest challenges was to clean the original dataset of 140,000 water systems and create meaningful county-level summary views. I explored countless different ways of slicing and dicing the data before deciding to use the average number of “violation points” by water system in each county as the main metric for the piece.
To highlight the prevalence of individual contaminants (eg “nitrates”), I calculated and mapped what share of systems in each county had reported the contaminant in question.
Another challenge was to decide which variables and patterns in the data we wanted to explore in depth, as there were numerous possible leads. For example, early research suggested that the smallest water systems violated the rules significantly more often than the largest ones, and we were expecting this to become a major news line. But this angle turned out to be only partially borne out by the data, and we ended up focussing on other patterns (specifically, race and income).
I also explored various chart formats that didn’t make it into the final piece, including scatter plots for countless metrics to multivariate maps.
What can others learn from this project?
For me, the two most valuable lessons to come out of this project were the following:
It highlighted the fact that extremely important and even shocking stories can hide “in plain sight” within a fully public dataset – such as the EPA’s records of water rule violations – and remain hidden if the data isn’t cleaned, summarised and reworked into a more accessible format
It is worth pulling in as many supplementary datasets and variables as possible, as the strongest patterns aren’t always immediately obvious, and anecdotal evidence, even when it comes from experts, could limit and bias your investigation