Over the past 70 years, Germany’s federal parliament, the Bundestag, has met in 4,216 sessions. The stenographers working in parliament transcribed more than 200 million words over those years. But only tiny excerpts of these speeches make it into the big news, the rest is recorded and then disappears into the archives. But it provides a valuable source and dataset. We have therefore made all speeches since 1949 analysable in an interactive tool. Readers can explore for themselves when which issues were debated and the how language relating to those issues has changed over time.
Just in time for the 70th anniversary of the Federal Republic of Germany, our tool makes it possible to look at the history of the country as if through a magnifying glass, to understand its twists and turns, its development. The curves show which debates were big and verbose, which were small. What was frequently discussed and when, what was rarely or never discussed in all these years, although it might have been important?
We discovered, for example, that in 1983, there were long debates in the Bundestag about whether fibre optic cables should be laid in large cities in order to transmit larger amounts of data. But neither the Federal Post Office nor the Federal Government succeeded in pushing through this innovation. It took more than 30 years to realise how important fibre optic technology is. The result being that Germany is known today for its very slow Internet.
Based on the news app, we created a text series Bundeswörter (https://www.zeit.de/serie/bundeswoerter). Other media also reported on the project. We received requests from students and researchers, including from Oxford University and the UN Security Council. We provided the researchers with our data. The project has also inspired endeavours in other countries such as Austria, were a similar project was implemented based on our idea. Furthermore, a book project with a major publishing house is in progress.
Furthermore, the project was a huge success on social media. Our hashtag #bundeswoerter was #1 on the Twitter-Trending-Topics for one day, the article was shared thousands of times – and not only the article itself but thousands of different graphs, since we included a sharing option for every state of the interactive graphic.
The frontend part was mainly developed using React (with libraries such as react-select) and D3. For the data work, we used Python: We scraped the data from the website of the German Bundestag using the Beautiful Soup and Requests libraries. We parsed the data in Python using Regular Expressions.
In the next step, we broke this text body down into individual words and created a separate document for each word, which contains the absolute frequency as well as the frequency per year. We have standardised the document length, since the number of speeches in the Bundestag has varied considerably over the years – and would otherwise not have been comparable.
We indexed the documents using the Lucene-based open source search technology Elasticsearch. Elasticsearch not only offers fast search options for texts, but also has several features for decomposing and analysing these texts. For example, we realised our word suggestions during input with the so-called Completion Suggester. This automatic dropdown also enabled us to forego a spell check. We contained the database in a Docker Container, so that we could adjust the performance to cater a large number of readers efficiently. We have blogged in detail about the back end technology: https://blog.zeit.de/dev/reden-im-bundestag-auf-knopfdruck-skalierbar/ (German)
What was the hardest part of this project?
The biggest challenge with our News App was to provide a perfect interface. It needed to be absolutely focused and immediately understandable and usable by everyone – but also be able to display special features of the language in an uncomplicated way. We consciously decided against using more sophisticated Natural Language Processing Methods, and to use an approach that all of our readers could understand: counting words.
During the course of the project, we consistently discussed how much we wanted to preprocess the text data. For instance, we tried lemmatisation to group similar words such as conjugated verbs. However, lemmatisation and stemming are still not perfect, especially in German. Initial tests and discussions with linguists have shown that lemmatising in German is still very error-prone and we have therefore decided to use a simple variant. Instead, we did offer the possibility of adding up all the counts for several words.
Another challenge was the sheer quantity of data: After parsing the 4,216 text documents since September 7, 1949, the date of the founding of the German Bundestag, we had about 200 million words. To allow a large number of readers to access the entire database at the same time, we had to provide a robust technical infrastructure.
What can others learn from this project?
Analysing large data sets in a newsroom often results in a single article or data visualisation. Oftentimes, these stand-alone projects do not justify the amount of work that goes into the process of cleaning and analysing the data. More importantly, many stories that can be found in the data go untold. This is why we started a series of smaller articles using the tool as a starting point for investigations.
We built the news app with a focus on social media sharing options, including a back end solution for personalized sharing images. It always pays off for us if we can break down information for our readers to the point where they can find out exactly what something means to them. That’s why we think of news applications as being shareable elements that can be personalised and shared via social media. In this application, our readers could share a personalised graphics and a parameterised URL leading to their own findings. Thousands did so on Facebook and Twitter.
It should also be noted, that there were no differences in hierarchy within the interdisciplinary team of 14 people. Backend developers, investigative journalists or designers were authors equally.
Google Translate: translate.google.com/translate?hl=de&sl=de&tl=en&u=https%3A%2F%2Fwww.zeit.de%2Fpolitik%2Fdeutschland%2F2019-09%2Fbundestag-jubilaeum-70-jahre-parlament-reden-woerter-sprache-wandel