In response to mass shootings, some schools installed devices with algorithms that purport to identify aggressive voices before violence erupts. Our data analysis found this technology unreliable.
Our reverse engineering found that while the device tended to infer aggression from strained voices, it had troubling blind spots. The device raised false alarms for benign sounds like laughing, coughing and cheering, yet high-pitched screaming often failed to trigger an alarm.
This confirmed our reporting, which found instances where the algorithm often reported false alarms, yet failed to trigger during a dangerous situation at a New Jersey hospital.
The story raised questions about the effectiveness of electronic surveillance devices installed in the name of safety. It also examined the rise of uninterpretable and unappealable black-box algorithms that pass judgment on people without their consent or knowledge.
After our story, the ACLU, citing our investigation, urged school districts and state legislatures to ban surveillance technologies such as facial and voice surveillance, as well as social media monitoring in schools.
We purchased the device to reverse-engineer and test it. We rewired its programming so we could feed it any sound clip of our choosing, and then played gigabytes of sound files for the algorithm and measured and collected its prediction for each.
The result was a database of public voices and sounds. The database consisted of a snippet of sound, its spectrum or “fingerprint” (generated using our data pipeline) and the algorithm’s prediction. We then used the millions of rows in this database to reverse-engineer the algorithm and analyze where it could be flawed.
After this preliminary testing, we ran several real-world experiments to confirm our suspicions. We recorded the voices of high school students in real-world situations, collected the algorithm’s predictions and analyzed them.
All of the programming done for the data analysis and machine learning was done in Jupyter Notebooks in Python. We used scientific computing packages such as scipy, numpy and pandas for the data transformation and analysis, scikit-learn, statsmodels and xgboost for machine learning and reverse engineering, librosa for sound analysis, and matplotlib and seaborn for data visualization.
What was the hardest part of this project?
We overcame numerous technical challenges in order to peer inside a black-box algorithm and understand how it passes incorrect judgments about ordinary people. Previous ProPublica investigations analyzed machine-learning algorithms that make predictions from structured, tabular data, but we had not yet seen any investigative reporting that analyzed machine learning on unstructured data, such as video or audio. We had to be innovative in how to collect and record the algorithmic output of the device. We modified the programming of the device so that we could test it with our own, custom-written software. The result was a multi-million row database of sound clips and their purported “aggression” s that allowed us to reverse-engineer the algorithm Raw sound data is also difficult to analyze. But we used techniques from signal analysis and processing rarely used in data journalism, such as Fourier transforms, to derive a spectrum — essentially a fingerprint — for individual snippets of sound. We studied academic papers in audio analysis and interviewed researchers in the field to arrive at a set of data features that we could derive and analyze from a sound spectrum. Finally, we also overcame a number of technical challenges in our field testing. We reproduced an in-school setup as closely as possible in our experiments. We recorded our sound using microphones bought from the company and relied on company manuals to set up their location, distance and height. When we brought our findings to the company, they disagreed with our conclusions, citing a audio engineering phenomenon known as “clipping” — when a microphone becomes overwhelmed by too much noise, distorting the sound and potentially throwing off the algorithm’s readings. We only found clipping in a small subset of our data, but to be sure we re-recorded the sound, controlling for clipping. Our results remained the
What can others learn from this project?
Our project showed that with the right tools and expertise, extensive reporting and advice from experts, it is possible to peer inside the black box of a machine learning algorithm. Journalists are often the first to report on misuses of flawed technology, and with machine algorithms becoming more widely used, it’s essential to help the public understand the impacts of machine-assisted decision making at scale.
We also wrote an extensive methodology to help other reporters and researchers do similar investigations in the future. We also put a lot of thought into how to best present to our readers a complex, technical topic such as machine learning. The integration of audio and video into our story helped give readers a sense of the device’s flaws that could not be conveyed only by words.