Using four years of data and statistical techniques, we discovered that after the U.S Supreme Court invalidated a key provision of the Voting Rights Act of 1965, leading to the closure of more than two hundred Georgia voting locations, an estimated 50,000 to 80,000 Georgia voters didn’t vote in the 2018 gubernatorial election because of their newly-increased distance to the polls.
Our work was cited by the NAACP’s Legal Defense Fund in their January 2020 document “Democracy Diminished.” It was cited by presidential candidate Elizabeth Warren, former presidential candidate Julian Castro, former Georgia gubernatorial candidate Stacy Abrams, and Terry Sewell, D-Alabama, the House sponsor of a bill intended restore the provisions cut from the Voting Rights Act of 1965 by the Supreme Court in 2013.
The analysis was primary done in R, however, we also used the Census geocoder. A great deal of spatial statistics was performed and vetted for the analysis, which required determining voters’ home location, voters’ precinct location, and the distance between the two. We used voter history data, demographic data from the voter rolls, shapefile data from the State of Georgia, and demographic data from the Census to create a model of how distance to the polls affect voter turnout.
Because voter behavior depends heavily on age, race, county, and prior voting history, we used voter history files and demographic information from the voter rolls to account for those variables.
We grouped voters by their age, race, income, county and prior voting history, calculating voter turnout in those groups for those near and far from their polling places, taking the difference between those rates to be the effect of distance on voting.
To say voting decreased because distance increased, however, is a causal claim—typically forbidden in the statistical sciences. We used a method for causal inference, called do-calculus, to determine which variables we needed to control for in order to properly estimate the causal effect of distance on voter turnout. We also consulted with political scientists and statisticians to verify and vet our methodology.
What was the hardest part of this project?
Creating a robust methodology with easily understood results was by far the hardest part of the project.
The underlying analysis behind this project would fit perfectly in a research-level social science journal. Deciding on a methodology that would be scientifically accurate but comprehensible to a lay reader, to whom our work is ultimately tailored, was not easy. It meant simultaneously meeting the standards of professional science and professional journalism, which while overlapping, differ significantly. Originally, we tried more advanced statistical regression methods. However, we decided against them because they presented another obstacle for a reader to overcome in understanding the results and the story. Those models were, instead, used as sanity checks for our simpler model.
Months were put into the process of selecting a model that was both accurate and understandable, and we consulted with academic political scientists and statisticians to ensure model validity. Then, we spent months translating our scientific results into a story, supplementing the analysis with the stories of real people who had difficulty voting because of the changes we identified. We ended up with one story that fused two worlds, not two stories crammed into one.
Creating research-level statistical analyses aimed at lay people is a difficult task, but it’s one some subset of data journalists are, correctly, doing more and more often. By selecting our work for the Sigma awards, investigative stories that originate in the statistical analysis of data will be more accepted in newsrooms around the world.
What can others learn from this project?
Firstly, that the Supreme Court’s 2013 Shelby v. Holder decision invalidating a key provision of the Voting Rights Act had the effect of dissuading voters from engaging in democracy, something the Supreme Court assured would not happen. Our project made clear that the Supreme Court’s decision had a negative impact on voter turnout in Georgia, an important battleground state in the 2020 U.S. presidential election.
From a news-making perspective, others can learn that there is space for inference in a newsroom. While machine learning and exploratory statistics have been rightly accepted as useful tools for finding stories worth telling, inferential statistics largely has not. In part, due to the scientific difficulty of statistical inference, in part due to journalists’ discomfort with publishing stories that cannot be entirely verified as correct (“all models are wrong, but some are useful”), statistical inference isn’t used as the basis for news-heavy stories. However, inference has already been used by news-organizations in a variety of ways. Polls released in conjunction with polling agencies, the “needle” at the New York Times, and electoral probabilities at 538 are a form of inferential statistics. Investigative projects whose discovery of neglect and abuse comes from statistical inference are as useful as scientific results that stem from statistical inference. Our project shows that this work is viable in a newsroom.
Unfortunately, inference is, admitedly, difficult. Others can learn about one proper way to do spatial statistics from our project. The methodology behind our work is hosted on a GitHub page in Jupyter Notebook format, so that other newsrooms with interest in this kind of work can see what we did and how we did it. While their projects won’t be identical, useful similarities will remain.