We created an online tool that enabled people to fill in pre-existing conditions and get the risk of hospitalisation and death for people with those conditions should they get covid-19. Anytime someone uses the tool, a machine learning model running in the cloud is used to generate a prediction. This tool, and the data used to construct it, was also the basis of an investigation into covid-19 illness.
Beyond the related publications and the hundreds of thousands who have used the online tool, the project is currently used by academic researchers at Brown, Harvard, New York and other universities, as well as the American Centers for Disease Control (CDC). Researchers at NYU Langone are currently, together with one of our journalists, using the tool to explore how to optimise the allocation of booster doses of vaccine. They have also checked it against their internal data with good results. These researchers turn to us because, as far as they know, no better tool is available.
The tools used were:
1. The Covid-19 Research Database, an online repository of medical information created by American healthcare companies
2. Python implementations of xgboost, numpy, and other libraries
3. Cloud services, as well as graphics and code to interactively present the information
What was the hardest part of this project?
The hardest parts were first getting access to the data, which were incredibly secure, and second, presenting them in a fair way that would not be misinterpreted.
To give an example of the former: copying anything, including text, into the server containing the data was impossible. That meant manually typing in all code, character by character, except python itself and the xgboost, pandas, and numpy libraries, which fortunately were uploaded to the server in advance. Exporting models was a weeks-long process.
With regards to presentation, we wanted to be careful. We therefore reached out to a range of academics on how to best use and present the data, who provided tips and guidelines on how we should present them. Their advice was very useful, but getting it all right required being very meticulous.
Designing the interface was challenging. We had to be mindful of numerous implementation details (for example if a query includes hyperlipidemia we include metabolic conditions) and present risk in a way that was clear to the general public.
What can others learn from this project?
We think other journalists may learn that it is possible to run ambitious machine learning models on-demand in the cloud, and that even personal medical data can be used journalistically, provided appropriate steps are taken and the data handled with care. Setting up the infrastructure to generate predictions for billions of different combinations of conditions, age, and gender was hard but we have created what we think is a valuable journalistic product on a par with the best research.