Return to page


COVID-19: Doing Good with Data + AI


By David Engler | minute read | March 26, 2020

Blog decorative banner image

During times of severe societal strain, individuals have historically shown an inclination to offer aid and assistance. Often these sacrifices have been at great cost to life or livelihood. In other cases, the efforts have been seemingly more mundane but nevertheless still essential. The efforts of the over 10,000 women code breakers of World War II is one such example. From 1941 to 1945, these women, recruited because of their math, science and foreign language abilities, worked tirelessly to break down and understand constantly mutating code systems. On any given day, a single individual’s efforts likely seemed minor. But in the collective, the results were substantial. At the conclusion of the war, Major General Chamberlin noted that these efforts “saved us many thousands of lives” and “shortened the war by no less than two years.” As data scientists, we currently have the ability to, in our own small way, contribute significantly to a contemporary battle: understand and prevent the spread of COVID-19.

Of note, it does seem clear that our most productive work on this topic will be in coordination with healthcare facilities and researchers. Just as the work of the WWII code breakers was collaborative and coordinated, so too should our efforts be collaborative ones with those on the medical front line. That said, there are a growing number of opportunities for interested data scientists. These include:

Moreover, there are increasingly a number of open-source data sets available for those willing to contribute to the effort. In our own efforts,  for example, we have made use of the following data:

  • There is the popular 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE which contains confirmed, recovered and deceased cases of COVID-19 around the world. For the USA, it can provide some of this information at state and county level too. The same information could also be retrieved from the following source in different formats. Bear in mind the data sets are not perfect. They contain inaccuracies and duplicated entries, but they should provide a good basis for getting a reasonable understanding of how the virus spreads around the globe.
  • The following website has information regarding total beds and ICU units from multiple hospitals across the USA. It also estimates their current capacity. Similar information can also be retrieved from the following online spreadsheet.
  • The COVID Tracking Project has information regarding COVID-19 tests for multiple states in the USA, along with a breakdown of whether they were positive or negative.
  • In the interest of comparing COVID-19’s average days of staying in hospital against other diseases, the OECD website contains very useful information for multiple diseases and for many countries.
  • Hospital admission rates for the USA can be retrieved from here. For state level hospital admission rates, there is a breakdown here.

Using these (and other such data), construction of time series models that predict future cases of COVID-19 for different geographic regions, as well as forecast hospital admissions and assess when maximum capacity will be reached for a given region.

For example consider the following SEIR (Susceptible-Exposed-Infected-Resistant) dashboarding application developed with H2O Q  and H2O Driverless AI  that is automatically updated as new daily data is made available.

The application first takes as input (in addition to the available data) selected hospital and demographic input for a given hospital system. Then, using the selected parameters, new cases can be forecast for a given region with daily updates:

www.h2o.ai2020/03/image-1024x186.png www.h2o.ai2020/03/image-1024x186.png

Second, using publically-available hospital bed data for a given region, capacity assessment for both overall hospital bed usage and ICU bed usage can be made:

www.h2o.ai2020/03/image-2.png www.h2o.ai2020/03/image-2.png
www.h2o.ai2020/03/image-3.png www.h2o.ai2020/03/image-3.png
www.h2o.ai2020/03/Screen-Shot-2020-03-25-at-6.49.15-PM-1024x314.png www.h2o.ai2020/03/Screen-Shot-2020-03-25-at-6.49.15-PM-1024x314.png

Then, based on the latest data, flags and warnings can be designed and implemented.

www.h2o.ai2020/03/image-1.png www.h2o.ai2020/03/image-1.png

Other simple, but useful applications are also possible. In some areas, substantial progress has already been made. Image processing, for example, has been found to be useful in the effective diagnosis  of COVID-19. Likewise, using EHR (electronic health record) data, it is possible to identify variables associated with severe complications . Currently, there are a number of pharmaceutical research firms using AI for COVID-19 drug development. Further applications might include assessment of the impact of the virus against economic indicators  and/or understanding the impact of weather  in the spreading of COVID-19.

In the end, it seems fruitful to explore areas of application where data science can contribute to the efforts to understand and combat COVID-19. Our hope is that, by joining forces, data scientists and medical practitioners can make effective and significant progress in these efforts.



David Engler

David Engler is a Senior Data Scientist and the Director of Customer Success at H2O. He has 15 years of experience leading data science teams in healthcare research and analytics and has over 20 publications in medical analytics as a primary author. He most recently built and led the analytics team for healthcare strategy at the University of Utah hospitals and clinics. David obtained his PhD in Biostatistics from Harvard University.


Marios Michailidis

Marios Michailidis is a competitive data scientist at and a Kaggle Grandmaster (ex World #1 out of 500,000 members) . He holds a Bsc in accounting Finance from the University of Macedonia in Greece and an Msc in Risk Management from the University of Southampton. He has obtained  his PhD in machine learning at University College London (UCL) with a focus on ensemble modelling. He has worked in both marketing and credit sectors in the UK Market and has led many analytics’ projects with various themes including: Acquisition, Retention, Recommenders, Uplift, fraud detection, portfolio optimisation and more. He is the creator of KazAnova, a project made in Java for quick credit scoring  as well as is the creator of StackNet Meta-Modelling Framework.  Marios’ LinkedIn profile can be found here with more information about what he is working on now or past projects.