We created a novel election-forecasting model to combine public polling, past election results, and demographic data to predict the 2020 US House, Senate and presidential elections. Our approach combined recent innovations in machine learning, Bayesian statistics, political science and polling to provide a reasonable estimate of the range of uncertainty in the election, with a desire to see through the noise of constant campaign coverage. We visualised the forecast with sophisticated interactive data visualisations on a custom-built website. Although polls this year revealed another pro-Democratic bias, our presidential model nevertheless finished first among media outlets for predicting state-level vote shares.
The project was a massive undertaking completed by a relatively small team of 5 people (three visual journalists and two data journalists) in a very short period of time. Cumulatively, the 2020 forecasts pages added up to the most trafficked Economist.com page ever, with 20 million views over a period of five months. The forecasting models also represented The Economist’s most sophisticated statistical modelling efforts ever. Our team made several advanced contributions to the public conversation about polling, including identifying the ultimate cause of the year’s polling error long before it occurred. Because of our data team’s constant dialogue with the survey research, political science and academic statistical communities, we were able to include an explicit modeling adjustment for polls that had too many Democratic or Republican voters in them, which helped our model perform better in states with biased samples. No other outlet so openly acknowledged polling’s weakness to so-called “partisan non-response bias.” In developing our forecasting models, we made a number of contributions to academic forecasting efforts and open-science. One of our data journalists co-authored several academic journal articles about the forecast, which used a newly-developed and very sophisticated programming language for fully Bayesian statistics. This allowed us to formally acknowledge more sources of uncertainty from the polling and non-polling data. This is both a valuable methodological step, and also enables us to give readers a more honest representation of our very complex and often unpredictable world. In a first for any media outlet, we also published the full data and code behind our presidential election model on GitHub. This allowed researchers to download the model and run it themselves, sending any feedback directly to our data team. The GitHub repo has over 1,200 “stars,” 167 “forks,” and dozens of issues where our team has engaged with the
Our modeling efforts used several state-of-the-art machine learning and Bayesian statistical algorithms to predict the election results. In coming up with election predictions from the so-called “fundamentals” — economic and political indicators that have a record of predicting election results — we used several R packages to train regularised regression models and run cross-validation techniques to tune out noisy indicators and fully explore the potential error of predictions made for the future. To integrate the polls, we trained a fully Bayesian statistical model with markov-chain Monte Carlo algorithms in the Stan programming language. This allowed us to make adjustments and add uncertainty for virtually every variable and every step in our modeling design — from our “fundamentals” forecast, to whether polls were conducted online or via telephone, to the demographic similarities between states when simulating possible outcomes.
The text of the site was updated through Google Docs with a custom version of ArchieML that ran on the Google App Script platform. This added an option on the Google Docs menu bar to upload the new copy directly to the CDN, without having to run command-line tools. We also added support for some Economist typographical conventions like small caps and paragraph flourishes.
The models, running on an AWS instance, fetched the polls and generated new data at 7:00 AM, noon, and 6:00 PM with a cronjob. The CSV files were synced to the CDN and made available to the frontend minutes later.
What was the hardest part of this project?
Our modeling efforts took a lot of time to get right — and even after hundreds of hours of collaboration, design, and training and testing models, they were still imperfect. This represents an overall weakness of election forecasts when they are removed from contextual journalism: what do we really learn from simply predicting the outcome of the election? We recognised this issue early in our conversations about strategy and accordingly worked hand-in-hand with our US political correspondents to embrace and explain the models to readers, but still may have fallen short in fully explaining what we were predicting, and why.
Getting the message of uncertainty across was also a very challenging part of the project. We experimented with different charts and displays to make the concept of the prediction more approachable, avoiding too many numbers in the copy and showing the electoral votes with 60% and 95% confidence intervals instead of the raw prediction by default. The main histogram has a “dartboard mode” that animates and displays the result of thousands of simulations at once, another novel innovation by our interactive visualisers. Additionally, we tried to show the inner workings of the model with a map that shows the correlation between states, or how their vote intention moves together. Packaging all this information in a way that the average reader could understand was challenging, but nevertheless worthwhile.
In the end, we believe that we developed a suite of data-journalism products that (i) clearly explained how polls work, especially how uncertain they inherently are; (ii) allowed us to focus only on real movements in public opinion, rather than fluctuations from outlier polls or other narratives; and (iii) fostered transparency and academic collaboration in a series of groundbreaking firsts for a data-journalism department in a large mainstream publication.
What can others learn from this project?
First and foremost, we would recommend that other data journalists partner with academics and statisticians to explore opportunities for collaboration. Our partnership with a renowned statistics professor and a graduate student at Columbia University proved very fruitful when creating our advanced models and exploring new ways to push the boundaries of election prediction and data journalism. We are confident that partnerships in other subjects would improve coverage and benefit readers.
If a team is entertaining future election-forecasting models, they should know that our efforts allowed us to meet several important goals in polling aggregation, election reporting, and data journalism:
First, given the performance of the polls in 2016, we wanted to create a forecasting model that explored a larger range of uncertainty in pre-election polls than popularly acknowledged. We also wanted to incorporate other sources of information that could help us predict the outcome. Because the polling industry is undergoing such massive technological and methodological changes, these steps are necessary to explaining how polls work, and how accurate we can (or can’t) expect them to be.
Second, we found that our model’s ability to identify key states long before the election itself was a valuable tool in allocating reporting resources and avoiding the constant noise and distractions of the news cycle. In this way, newsrooms should view forecasts not only as public-facing investments, but also as internal ones.
Finally, The Economist has strived to lead the industry in embracing standards of transparency and open science. This was a clear weakness of election forecasting in 2016, plausibly contributing to the public’s misunderstanding of election forecasts and polling. Our public discussions with a community of coders online at GitHub.com have improved our journalism, and we recommend all outlets follow suit in releasing their data and code wherever possible.