The U.S. health authority, the National Institutes of Health (NIH) archives clinical trials conducted in the United States and 220 countries around the world on the ClinicalTrials.gov website. This represents approximately 370,000 studies, with a wealth of information. ClinicalTrials allows anyone to download this gold mine in the form of thousands of XML (Extensible Markup Language) files. This is what we did last March, and then we searched for keywords that could indicate the presence of digital technologies in each trial: machine learning, real-world data, blockchain, clinical trial management system, e-CRF…
Pharmaceutical laboratories and healthtech start-ups are part of the subscribers of our business oriented publication, mind Health. COVID-19 challenged them exceptionally in the recent past, particularly the need for quicker, safer and more efficient clinical trials to test new drugs and vaccines.
Our first aim was to objectively measure the growing influence of digital technology (AI, wearables, blockchain…) in clinical trials and associate tangible data to this phenomena. By doing so we helped the health industry to understand this trend better and know better how to cope with it. It also allowed us to :
move away from the mere PR advertising – which are far too common in the field of innovation – to report on the real transformation of the pharmaceutical industry based on tangible data
establish a solid database on the use of digital technology in clinical trials in order to design indicators that can be updated and compared over the coming years
This 3-articles story, illustrated with a dozen charts, was mind Health’s most read in 2021. Each of the three parts were among the first 20 contents.
Since February 2000, the U.S. health authority, the National Institutes of Health (NIH), and the U.S. National Library of Medicine (NLM) have been archiving clinical trials conducted in the United States and in 220 countries around the world on the ClinicalTrials.gov website. This represents approximately 370,000 studies, the oldest of which date back to 1931. For each study, there is a wealth of information: the subject, the sponsors, the countries where the study was conducted, the therapeutic areas concerned, etc. The information is provided and updated throughout the study by its sponsor or principal investigator.
ClinicalTrials allows anyone to download this gold mine in the form of thousands of XML (Extensible Markup Language) files. We started by selecting only interventional clinical trials (observational trials are not included in the study) and then searched all the information associated with each trial (title of the study, abstract, etc.) for defined keywords. For example, for trial management technologies: CTMS (clinical trial management system), e-CRF, eCOA and econsent. For telemedicine technologies: telehealth, teleconsultation, telecare, homecare and remote site monitoring. These keywords may not always be as comprehensive as we would like, and some terms may be common to clinical trials that do not include digital.
We used Python to analyse the data, Datawrapper and Flourish to make them visual.
What was the hardest part of this project?
When we decided to launch the project, the first step was to select a list of keywords that would indicate the presence of digital technologies in each of the 360,000 clinical trials we uploaded.
Afterwards, the challenge was to find a solution that would allow us to transform thousands of XML files into a single database. In Python, we created a function that transforms the XML files into Pandas rows that are then assembled into a single dataframe. After this, it was easier for us to select the clinical trials that match our criteria and the fields that we wanted to analyze.
The clinical trial data is filled in by the sponsor, therefore the data was sometimes incomplete or filled in in different ways. We had to select, clean and standardize it in order to get the best insights during the analysis.
Finally, we had to find out which clinical trials were interesting to highlight for each digital technology. A part of the project was dedicated to documentation in order to select the most innovative and impactful clinical trials of the last few years.
What can others learn from this project?
Stories about clinical trials results can be exciting for journalists and audiences. However, looking at the big picture allows us to track the development and the evolution of the pharma industry.
ClinicalTrials.org database contains huge amounts of data on all areas of clinical research that are worth exploring. The fact that they are stored in nearly 400,000 thousand XML files should not be a barrier since there are tools and methods to facilitate their analysis on a meta-level. Other journalists could lean on this project to learn how to work with a trove of XML files to analyse data with Python.