Which French media websites use Google Analytics ?
Entry type: Single project
Publishing organisation: Groupe mind, mind Media
Organisation size: Small
Publication date: 2022-11-25
Authors: Sara Chaouki, Paul Roy, Aymeric Marolleau
Sara Chaouki is datajournalist at Group mind since September 2020. She has a degree of science journalism from the Université de Paris and a degree of applied mathematics from ENSEEIHT.
Paul Roy is journalist for mind Media covering media and adtech since 2019.
Aymeric Marolleau is in charge of the datajournalism team within the Group mind since early 2019. From 2015 to 2018, he was a specialized media and adtech journalist at mind Media. He has been a business journalist for about ten years.
The current configuration of the Google Analytics web page traffic measurement tool was officially deemed contrary to the GDPR by the French data protection authority early 2022. In the fall of 2022, the mind Group’s datajournalism unit and mind Media’s editorial staff wanted to know to what extent Google’s tool is still used by French media and e-commerce publishers.
To do this, we developed a crawler that automatically visits 450+ home pages, clicks on the “Accept” button of the CMP, and then checks the HTTP requests to see if the call corresponding to each of the tools has been triggered.
We were able to show that among the sites using Google Analytics, 45.4% use it in its “Universal Analytics” (UA) configuration, which is considered non-compliant by the CNIL. 9.6% use it only in its Google Analytics 4 version, which has not yet been approved by the CNIL. 45% combine both versions of the tool.
This work, which has been widely consulted and praised by digital professionals in France, has shed light on the market’s practices and contributed to an awareness of the issues.
The presence of this “analytics.js” tag alone, however, does not mean that the tool is active. The site may have stopped using it without removing the tag from its pages. To make sure it is active, we looked for the “collect” parameter in Google Analytics queries (google-analytics.com, analytics.google.com, region1.google-analytics.com for Google Analytics 4, etc.). When “collect?v=1” appears, the site uses Universal Analytics.
Several biases are possible. When sites have opted for a solution hosted on the server side, when they use the proxyfication process (use of a proxy server to avoid any direct contact between the user’s terminal and the servers of the measurement tool, or when they use a delegation of subdomain to a third party via a redirection (or “CNAME cloaking”), the identification of a solution may be more complex, and this one may have escaped us. In addition, some sites implement specific configurations of their tools, which make their solution undetectable to our network query-based method.
Context about the project:
In opaque sectors (finance, health, digital adverstising…), where our journalists sometimes find it difficult to get around the communication wall put up by the companies they are investigating, we are particularly specialized in identifying the relationships between technology companies and their clients (media publishers, banks, retailers, e-commerce sites, pharmaceutical laboratories, etc.), thanks to the digital traces that each of them leaves on the Internet (cookies / HTTP requests, SDKs, ads.txt and sellers.json files, etc.)
What can other journalists learn from this project?
This project can help other journalists to take advantage of the information available in the HTTP requests of a website.