The Pentagon’s elite language school happens to be located in our town of Monterey, California. Known as the Defense Language Institute, it is an important but opaque part of our local community. I filed a data request for the number of students enrolled in each language at the institute going back to the school’s founding in the 1960s. The data eventually arrived—but in an arcane format. I cleaned, analyzed, and visualized the data, mapping out how the curriculum and enrollment levels changed over time in response to the shifting priorities of U.S. foreign policy.
The Defense Language Institute looms large in our town but the community knows little about it. I was completing my story just as the U.S. assassinated top Iranian commander Qassem Soleimani. With the help of the data, and sources like former U.S. Secretary of Defense Leon Panetta showed that alumni of the institute had possibly been involved in the operation.
The story became the talk of the town, and is still referred to by readers even a year later. We are always being asked to apply the same data journalism techniques from that story to new topics.
I filed a request with the U.S. Army for data on the enrollment levels for each language taught at the Defense Language Institute going back to its founding. After a few months, I received hundreds of pages compiled into one PDF. The pages were computer printouts that had been scanned and digitized.
They showed tables of numbers, but since they were manually scanned, the tables were all slightly crooked or morphed. I tried using various OCR and extraction tools to scrape the tables and converted them into CSV files but nothing worked well enough. Eventually, I tried Abbyy FineReader and I managed to do it.
Once I had a master CSV file, I used Google Spreadsheets and Excel to clean the data. I sorted and filtered and created pivot tables to study the data. I quickly obtained facts that even the language institute didn’t and couldn’t know, such as the top language for each year, the top language overall and even just the list of all languages ever taught at the school.
From there, I used Google Flourish to visualize the data. My visualizations included various interactive charts including a bar chart race. I also created a searchable database with information on each language taught. The story also includes a link allowing the public to download the data and use it as they see fit.
What was the hardest part of this project?
The hardest part of this project was working on it alone. I am part of a very small newsroom with no history of data journalism and no one else with data skills. I had to pitch the idea and convince my editor it was worth the effort. Then I had to figure navigate the U.S. Army FOIA process which includes writing to lovely addresses such as “email@example.com.” I also had to do the data analysis and visualization on my own, figuring out whcih skills and tools I am missing and acquiring them as I went.
What can others learn from this project?
One of the most important takeaways is probably the potential to blur the divide between local stories and national/global ones. In virtually every community, there are institutions or people who play a part in some international saga. Local readers want to know to just about their own communities’ affairs but also about how their community fits into the larger world. And because data reporting is still in its infancy in local journalism, there are a million unique data sets like the one I found waiting to be requested.