Peering Into The Black Box: Investigation Into Social Sentinel
Entry type: Single project
Country/area: United States
Publishing organisation: The Dallas Morning News, the Investigative Reporting Program at UC Berkley’s Graduate School of Journalism, The Pulitzer Center
Organisation size: Big
Publication date: 2022-09-20
Authors: Arijit D. Sen and Derêka K. Bennett
Arijit (Ari) D. Sen, is a computational journalist on the investigative team at The Dallas Morning News. Sen has a master’s degree in journalism and a certificate in applied data science from UC Berkeley and a bachelor’s degree in journalism from UNC-Chapel Hill. He previously wrote for NBC News and the Asheville Citizen-Times.
Derêka K. Bennett is reporter at the Investigative Reporting Program at UC Berkeley’s Graduate School of Journalism and an independent documentary filmmaker. She has a master’s degree from UC Berkeley in video journalism and a bachelor’s degree from the University of Michigan in screen studies and journalism.
Our investigation reveals more than three dozen colleges have purchased an AI surveillance technology called Social Sentinel to monitor students’ social media.
Despite publicly claiming its service was not a surveillance tool, in emails, Social Sentinel repeatedly promoted its ability to “forestall” protests, offered a feature allowing schools to enter keywords related to contentious events and even authored a whitepaper about how it was effective for monitoring demonstrations. Social Sentinel also claimed its AI could help schools prevent suicides and shootings. But in our investigation, we found no evidence that a student’s life had been saved because of the service.
Shortly after publishing the first major story, a North Carolina legislator launched an inquiry into state schools’ use of surveillance technology and UNC ended its contract with Social Sentinel. The story also received attention from local, state and national media organizations. Most hearteningly, at least 15 college newspapers wrote about our story, and it was taught in several university courses across the country, fulfilling our goal of raising awareness about school surveillance and inspiring students to pursue investigative reporting on their campuses.
We conducted a computational analysis of nearly 4200 posts flagged by the service, which we obtained either from the documents themselves or through links to PDFs which were located inside of the documents.
We relied primarily on the Python programming language in order to accomplish this task. In order to parse the PDF files, we first conducted optical character recognition using the command line tool ocrmypdf, performed inside of a Jupyter Notebook. We then wrote code to loop through each document, extract the text of the PDF and separate out each alert, using the PDFPlumber library and Python’s built-in list comprehensions. Once these alert chunks were created, we parsed each chunk with regular expressions to obtain as much data as we could, including the text of the post. These elements were stored as a list of dictionary elements, which were later transformed into a pandas DataFrame, cleaned and ultimately stored as a CSV file and a Google Sheet. We then read this spreadsheet into another Jupyter Notebook for further analysis.
To conduct our analysis, we used the Natural Language Toolkit (nltk) library to remove all punctuation and so-called “stop” words (“the”, “but” “and” “or” etc.) from the posts. We then used nltk’s word_tokenize function to get a list of the posts’ remaining words and then counted the number of times a word appeared across all alerts. This was then visualized as a bar chart, first within the Notebook using the Plotly library, then for the final story in DataWrapper. We also experimented with topic modeling with the large language model BERT, specifically the BERTopic package and the HuggingFace’s pre-trained transformers, but did not ultimately present this analysis due to package deprecation and difficulties explaining it to non-technical readers.
Context about the project:
This investigation originated when one of the reporters, Ari Sen, was a student at UNC-Chapel Hill. At the time the school was dealing with protests over a Confederate statue on campus, and Sen wanted to know what campus police were doing behind the scenes. He filed a records request and obtained thousands of pages of documents. Buried within them was a contract for Social Sentinel. After writing a story on UNC’s use of the service for NBC News, Sen suspected that other colleges were using the service to monitor campus protests as well, and applied to graduate school with the intention of making that topic his thesis. After being accepted to Berkeley, he soon developed a pitch for the story and was paired up with the other reporter on the story, Dereka Bennett. Together they spent more than a year reporting the story for The Dallas Morning News until the project’s publication in late 2022. All told it took more than three years from conception to publication.
Up until the moment of publication, we faced significant delays in getting the records we requested, astronomical cost estimates for information — one school quoted us $40,000 — and university lawyers who cited obscure and ridiculous objections to providing the material. Some schools also claimed that material did not exist when it was eminently clear that it did — for example, several schools failed to provide the alerts from the service even though they were included in the scope of every records request we filed. We also had issues with universities providing information as non-machine readable PDF files, which did not allow us to click links to obtain more information. We were able to resolve some of these issues by obtaining more funding from the Pulitzer Center and by frequently working with public records officers to craft requests that allowed us to access the information we needed in a more timely and less expensive manner.
Despite these challenges, this project was, to our knowledge, the first and only comprehensive examination of the use of social media surveillance software on college campuses, the first to identify many of those colleges and also the first to reveal the systemic marketing and use of these tools to monitor campus demonstrations. Though we had many seasoned editors and advisors on this investigation, all of the work was completed by two twenty-somethings while both were still in graduate school, where they had to juggle several other class assignments and projects. For both, it was the first major investigation they had authored themselves. The first story in particular was enormously well-received and shocking to campus, local, regional and national media outlets and to students at affected universities, inspiring many of them to conduct investigations of their own.
What can other journalists learn from this project?
I when other journalists read this project they start to become more aware of the spread of surveillance techology, from intelligence agencies to local police down to tiny college campuses. Our digital traces, are increasingly being tracked and used against us in large and small ways. If more reporters are inspired to go after those stories I would judge this project to be a success.
I hope this story also paves a way for more reporting about how algorithms and AI are changing our lives. I’ve been lucky enough to witness the beginnings of this field of reporting as an inagural fellow in the Pulitzer Center’s AI Accountability Network. Whether as part of a formal cohort, with the support of a major newsroom or own your own, we need more projects like these.
Finally, I hope our success shows student journalists that they can do serious investigative reporting on their campuses, and that their stories and voices can change things. I started this project as a college student at UNC. Three years later, I got them to end their use of the surveillance tool I discovered on their campus.