This investigation set out to answer the question: which is cheaper, rail or air travel?
We scraped over 22,000 ticket prices to get an answer to our question and found, to our surprise, that domestically the average price of travelling by train was cheaper than flying 9 times out of ten. Even more surprisingly, flight prices were affected more by waiting until the last minute.
Even when we made flights as cheap as possible, by removing hand luggage and by comparing only the single cheapest departure each day – hello, RyanAir – trains still beat planes seven times out of ten.
Our investigation became a big story when published in the travel-heavy week leading up to Christmas. It was picked up by national news agencies and all major outlets.
Within Sveriges Radio, alongside the national news stories and social media output we produced, the reporters also appeared on radio to comment on the story, and we shared the data with local radio stations who used it to get a local follow-up on the story.
The results of our investigation were quite controversial and garnered quite a bit of attention from both environmentalists and proponents of flying. This quickly became obvious from the many – many! – readers and listeners getting in touch, and our story was hotly debated that week, with social media discussions and emails both praising and questioning our methods.
The first article we published about this was the most read of any I published last year – and by far the story with the most audience feedback.
We think this was down to the unexpected outcome. As the Royal Institute of Technology expert we spoke with said, it’s communicated in Sweden that trains are expensive, and so many intuitively think flying is cheaper. This misconception was something we were able to correct with our analysis, backed up by the rigorous method we used.
To scrape the ticket prices, we built scrapers using Python and Selenium. These were set to loop through all desired travel routes and dates – a total of over 1,000 different combinations! To optimise such a large number of requests and speed up the scrape, we used Python’s multiprocessing module to allow several requests to run concurrently.
Once the ticket price data had been collected and saved out in csv format, the actual data analysis was done in R, largely using tidyverse packages. Grouping the data by travel type, route and date, we calculated the average and lowest price of a ticket, and then counted the number of times air or rail travel was cheaper. We also additionally grouped the data by number of weeks between the travel date and the scrape date, and were thus able to see whether air or rail travel increased more in price when booked last minute.
Before publication, a coworker on the data team checked the scrapers and the results of the data analysis, using Python. This is a standard part of our team’s process, as we always verify one another’s results before publication, ideally using another method to reproduce the results.
Local reports to share data with Sveriges Radio’s local radio stations were produced using R Markdown. The output of our investigation was obviously mainly radio news stories, but was also published on the Sveriges Radio website and shared on our national and local social media channels. For this, we created visualisations using R’s ggplot2 library as well as a custom-built package our team has built to create graphics in Sveriges Radio’s in-house style.
Context about the project:
Building the scrapers was technically complex and quite time-consuming, but despite this the most difficult thing for this project was not necessarily the technical side, but – as always with data journalism – simply ensuring we were measuring the right thing. Were we getting an accurate picture of people’s travel experiences? What was the fairest method of comparison between trains and planes?
For instance, we spent a long time debating how to look at the ticket prices – was the average price more reflective of most people’s trips, or the single cheapest ticket each day? Both have their up- and downsides, and neither paints a complete picture of reality. In the end, after discussing internally with other data journalists and with experts at the Royal Institute of Technology, we focused on the average price, as giving a more complete picture, but also looked at the single cheapest ticket each day as a secondary measure.
Similar discussions were had over the fairness of the time period we looked at, the type of ticket collected, and so on. Going through our methods with external experts was, as usual, an invaluable step in the process.
What can other journalists learn from this project?
There are a number of concrete things to take away from this project. Hopefully, of course, it’s a good example of how to use Python for a scraping project.
Using the graphics created, the project could also be used to illustrate how to use R’s ggplot2 package for data visualisation, and also how to create a custom-made package for publication-ready visualisations in R, in an in-house style.
The national story was broken down locally for Sveriges Radio’s local stations around Sweden, and the data was shared by programmatically creating local reports using R Markdown, something which may be of interest to other news organisations with both national and local stations.