‘The UK construction companies in breach of the Modern Slavery Act’ is a long-form investigation that explores how many UK construction companies do not comply with the Act, by not publishing a Modern Slavery Statement. The investigation includes the case study of one of the major hot spots of modern slavery in the construction sector: Indian brick kilns. The story also examines the poor quality of the statements that are published, and the failures of the Act to hold companies to account.
In response to a Freedom of Information Request, the Home Office acknowledged that their approach to organisations identified as not having published a statement was under active review and that they were also considering potential future changes to section 54 of the Modern Slavery Act 2015. The investigation was picked up by NGOs such as Anti-Slavery International, Hope for Justice UK, Slave-Free Alliance, and Sustain Worldwide, among others.
For this investigation, I gathered original data using automation and web scraping (Python), data analysis (R) and Freedom of Information Requests.
I wrote code from scratch to find and scrape the websites of these companies (over 1400 websites) to check if they had published the statement, (as required by law), and if they had, to extract it and store it in a dataset.
The main problem was that I only had the name of the companies, not their website or any other data.
1- First, I needed to find their websites.
2- Then, if found, scrape the statements from their websites and store them.
1- Automation combined with web scraping using Python (the code on Github: https://github.com/Liana1708/Web_scraping_companies)
2- Data cleaning and analysis of the scraper results using R, as well as data preparation for a second scraper.
The scraper imported a list with the names of the companies. Using Selenium, the loop iterated through each one of the company names, automating the following tasks:
1- Type the name of the company in the Google search box.
2- Look for the company’s website and click on it.
3- Inside the company’s website, it looks for the Modern Slavery Statement, and if it finds it, it will return the link. If it doesn’t find it, it will print ‘No document found’.
4- Return a list of company names, websites accessed and links to the statement, if found.
After the data cleaning and analysis in R and running a second scraper, I built a dataset with the links obtained.
What was the hardest part of this project?
I had thousands of textual sources (modern slavery statements), distributed on different platforms (two statement repositories and over 1400 companies websites), in inconsistent ways (different parts of the website, under links with different names, etc.). The automation process was not completely effective or reliable. On many occasions, it failed to find the company’s website and scraped a wrong website instead, on others the statement was not where the law required it to be (in a prominent place on the page), and the scraper failed to extract it. Moreover, although companies are required by law to publish the statement on their website, there are two other repositories for these statements: TISC Report and Modern Slavery Registry, and many companies do not publish it on their website but in one of these repositories. In order to state categorically that a company had not published the statement, it was necessary to check that it had not been submitted to any of the three websites. In addition to the technical difficulties, there were legal limitations, such as the prohibition of scraping the TISC Report website. The first step to address this was to ask the organisation directly for the data, without success. The only possibility here was to manually enter the name of each of the more than 1400 companies in both repositories and build a database with the results for each company. Finally, I cross-checked both datasets -the results of the web scraper with the results of the manual analysis of both repositories- and obtained the final list of construction companies in breach of the Act.
What can others learn from this project?
Although time-saving is a major advantage -the scraper performed in hours a task that would have taken weeks-, the success of the automation process was partial. When faced with unstructured data, the technical difficulties can have a major impact on the investigation, resulting in inaccurate and unreliable data; which needs to be confirmed and complemented by the human. Without manually searching and cross-checking the information the data would have been unreliable. For some years now, there has been a debate in the industry about whether AI tools could jeopardise the future of the profession, by replacing journalists with news generated entirely by AI. However, I believe that particularly in investigative journalism traditional reporting to obtain data as a complement to automated processes will continue to be fundamental, as well as the role of the journalist as an interpreter of the information once it has been obtained.
I conducted all the interviews from the UK using Zoom, adapting myself to the lockdown restrictions. The interviews with the workers rescued from bonded labour in India were the most difficult to coordinate and carry out. Firstly, because of language barriers (they only spoke Hindi and it was necessary to have someone else translating from Hindi to English). Also because of physical and technological obstacles: Amid the pandemic and lockdown in India, workers lived in remote villages and did not have smartphones. Collaboration with a local NGO that works directly with survivors was key. They contacted the workers and provided them with smartphones in offices near their villages in order to facilitate the interviews. They also functioned as translators.