For this groundbreaking investigation into Google’s flagship product, Google Search, we used novel computational techniques to expose how the company routinely boosts its own products, pushing down “organic” search results leading to other websites.
The findings of this investigation and a follow-up story informed debate in Washington and a congressional antitrust committee’s calls for wide-ranging regulation of the tech giant and its brethren.
The facts we uncovered were cited in all three antitrust lawsuits filed against Google by the Department of Justice and state attorneys general last year. The Department of Justice’s historic antitrust lawsuit alleged that Google “has pushed the organic links further and further down the results page” and referenced training documents we obtained with specific instructions to Google employees to avoid using phrases like “market share” and “dominant.”
Another suit, filed by 10 states led by the attorney general of Texas, referred to Google’s “walled garden,” while another lawsuit, filed by 38 states led by the attorneys general from Colorado, Nebraska, Iowa, and Tennessee, emphasized how Google redirects search traffic to itself.
Our findings were also cited as proof that Google had built a “walled garden,” during the questioning of Google CEO Sundar Pichai at a congressional antitrust hearing with the nation’s four tech giants. The final report from the House Judiciary antitrust subcommittee’s year-long investigation into big tech also mentioned our work as evidence of Google’s monopolistic behavior.
We scraped Google Trends to source the top search queries for all available categories (business, entertainment, science and technology, sports, and top stories) every six hours for two months. To do this, we had to reverse engineer Google’s client-side API by listening to network requests in the browser’s “Dev Tools,” copying the cURL request, and retrofitting it to operate as a standalone API.
We then used these search queries to perform Google searches on a mobile emulator we created using Selenium. Like the search queries, we maintained this continuous data collection for two months.
We processed the data using BeautifulSoup, Pandas, and Selenium (more on this step in the next question).
Due to the experimental nature of our project, we had rigorous error checking and validation. We used the drawing library p5.js to automatically annotate screenshots with bounding boxes indicating the space occupied by Google elements and our other categories (more on this categorization scheme in the next question). We used the annotation software Prodigy to record the precision and accuracy of bounding boxes determined by our parsers.
What was the hardest part of this project?
Developing parsers for Google Search required intimate domain knowledge and innovative technology.
There is no existing taxonomy of the enormous variety of results delivered in response to a Google search. We created a classification system that was robust and general enough to apply to all mobile Google search results. This process required four months of research, interviews, and sifting through troves of source code.
We then had to encode this knowledge into automated web parsers. It took months longer to build a total of 68 unique parsers spanning more than 1,000 lines of code to identify elements in our five-category classification system: Google answers, Google products, non-Google, Ads, and AMP.
Much of this process overlapped with our time spent developing our classification system, for a total of more than six months for both.
Still we weren’t done. We had a unique need for our story: to measure the placement and prominence of each of these categories of search results. To quantify this unique spatial data, we created a novel web parsing technique—inspired by the biology lab assay—that “stains” the area occupied by elements in each of our five categories.
We achieved this by leveraging the xpath of elements categorized in our parsers and re-rendering the parsed search pages in the Selenium mobile emulator. This yielded spatial metadata (coordinates and dimensions) for categorized web elements, which allowed us to quantify Google’s self-preferential treatment of its own properties.
What can others learn from this project?
Our project highlights three important lessons for other journalists:
Build your own datasets
Be mindful about what you’re counting
Be honest about your limitations
Accountability journalism depends on evidence. But when you write about private companies, as we do, information is particularly difficult to obtain. For instance, Google has no incentive to quantify how much space it is devoting to its own products on the search results page. And most independent analysis we had found was sparse or anecdotal data from search engine optimization consultants.
We firmly believe that building your own dataset is essential to service journalism. We hope that our multi-step data collection processes will inspire other newsrooms to devote the time, resources, and leadership support to build their own datasets.
However, even a made-to-order dataset is not necessarily story-ready. The most basic unit of what to count might not be clearly stated as a column in the dataset. Instead, we need to do what all journalists do: interviews and research. This applies to not only talking to people, but also talking to data so that you fully understand the elements and the architecture.
We were able to build our classification system through acquiring both intimate domain knowledge and an understanding of the structure of the search results page. This is how we were able to accurately count something that has never before been quantified.
Lastly, we hope that our Limitations section encourages other journalists to prioritize accuracy and honest conversations over big numbers and shocking statistics. We think it is of the utmost importance to disclose the shortcomings of projects and be precise about the claims that can be inferred from our findings. In this project, we achieved this in part by seeking feedback from computer scientists, statisticians, and industry professionals regarding the methods we were using.