Amazon’s Advantage

Country/area: United States

Organisation: The Markup

Organisation size: Small

Publication date: 14/10/2021

Credit: Leon Yin, Adrianne Jeffries, Evelyn Larrubia, Ben Tanen, Joel Eastwood, Gabriel Hongsdusit, Micha Gorelick, Jeff Crouse, Ritu Ghiya


Before joining The Markup as an investigative data journalist, Leon Yin was a research scientist at NYU’s Social Media and Political Participation Lab (now the Center for Social Media and Politics), a research affiliate at Data & Society’s Media Manipulation team, and a NASA software engineer. 

Adrianne Jeffries writes stories examining the power platforms exert and exploring the consequences of automation. She started as a tech reporter a decade ago at what was then called ReadWriteWeb and has worked at The Verge, Motherboard, and The Outline. 

Project description:

An analysis by The Markup of popular product searches revealed that despite Amazon’s repeated insistence—including to Congress—that it is a neutral marketplace, the online giant was placing products from its own brands above more-highly-rated and better-selling competitors on its website. Often, Amazon’s connection to the brands was not disclosed. We interviewed sellers who said that Amazon’s actions harmed their businesses. We also built a tool to highlight Amazon brand items for online shoppers in multiple countries.

Impact reached:

Days after we published our first story, the House Judiciary antitrust subcommittee demanded answers from Amazon CEO Andy Jassy and reminded him that lying to Congress is a crime. A representative of the U.K. Competition and Markets Authority emailed The Markup the week after publication saying they’d read the investigation “with great interest,” as the findings are directly in line with its areas of inquiry.

Our browser extension is helping more than 5,000 shoppers spot Amazon brands and exclusive products in the United States, France, Canada, Spain, India, Germany, Japan, and Mexico.

Techniques/technologies used:

We tackled this algorithmic audit in multiple creative ways.

First, to create a viable, defensible sample, we focused on popular searches. After various iterations, we hit on a combination of Amazon’s lists for sellers and search suggestions from online marketplaces’ autocomplete functions.

Next, to identify Amazon brands (there is no existing list), we started with Amazon’s “our brands” filter, using the company’s definition: Amazon brand items and products from other brands that are exclusive to the site. But the filter is only available on some searches, and its results aren’t complete. So we also looked for disclosures of Amazon’s affiliation in product listings, built a list of proprietary electronics, and scoured trademark databases. 

Once we’d collected the data and categorized the products, we used machine learning to analyze the results by building a model that determined which of several factors—including ratings and a proxy for sales—was most important in Amazon’s placement of a product on top or in the second spot. The model revealed that being an Amazon brand was by far the most important. 

We also noticed that the company failed to disclose many items as being from an Amazon brand. So we commissioned a national survey of 1,000 U.S. adults to see if they could identify Amazon’s top-selling brands and to ask what attributes they presume Amazon uses to pick top search results. Only 7 percent of respondents recognized Amazon’s best-selling brands, apart from Whole Foods and Amazon Basics, and some even missed those.

Finally, we built a browser extension that is both a public service and experiential journalism. Brand Detector highlights Amazon products for shoppers in several countries. Because Amazon regularly changes its site, breaking the extension, we monitor changes to the site and update the software continually.

What was the hardest part of this project?

Beyond the typical challenges of sample selection and categorization, the hardest part of investigating Amazon’s search results was deciding how to analyze the data. We could see that Amazon had given itself a disproportionate share of the top results, but what was the appropriate amount?

After consulting with statisticians and machine learning experts, we decided to take a modeling approach to analyze the correlation between each of the factors to which we had access and the outcome. 

Determining what outcome to model was not obvious. We tried to predict the rank of a single product based on the number of reviews and stars. However, this prediction task omits competition among multiple products on the same page, which is exactly what we wanted to investigate. We needed to find a way to model the outcome for several products from the same search, and figure out which one Amazon placed higher. 

Ultimately, we overcome this obstacle by transforming the data to capture comparisons between two different products in each row of data. 

Then we had to decide which type of supervised machine learning algorithm to use to predict outcomes. Our dataset contained mixed-data types, meaning each row of data contained both continuous integers and Boolean values. This ruled out many tried-and-true models like linear regression or lasso but made a strong case for decision trees. To avoid overfitting, we used a random forest model, which uses hundreds of decision trees trained on random subsets of the data.

Finally, to test the accuracy of our model, we conducted an ablation study, which involved training and evaluating the model with every possible permutation of variables. This allowed us to conclude that just knowing which product was an Amazon brand or exclusive drove the accuracy of our model.

What can others learn from this project?

One of the largest challenges technology journalists face in watchdogging these major platforms is an uneven balance of information. Unlike police, schools, or government bodies, these private companies are under no obligation to share data with the public.

One solution is to use technology to try to create transparency where none exists. And that’s what we did in this project. We used a creative mix of public-facing information from the Amazon site, its metadata, some information made available to sellers, patent records, and even a survey to investigate the questions, Does Amazon place its brands first? Are buyers aware that they are being driven toward an Amazon brand?

Often, we investigate companies’ use of machine learning, typically finding harms. But in this case, we used machine learning ourselves as a journalistic tool, allowing us to draw conclusions that are vital for the public to understand how this massive company is using its power to its benefit. 

Journalists can also learn to use technology like browser extensions as a form of storytelling and a public service. This allows people to experience reporting outside of a news article during their everyday lives browsing the web, which we believe extends the reporting and is a modern iteration of the journalistic axiom “show, don’t tell.”

Project links: