Caution Is Required When Clinically Implementing AI Models: What the COVID-19 Pandemic Taught Us About Regulation and Validation

By Keerthi B. Harish and Yindalon Aphinyanaphongs



Harish K, Aphinyanaphongs Y. Caution is required when clinically implementing ai models: what the COVID-19 pandemic taught us about regulation and validation. HPHR. 2021;45. 


Caution Is Required When Clinically Implementing AI Models: What the COVID-19 Pandemic Taught Us About Regulation and Validation


The novelty of COVID 19 ushered an expansion of artificial intelligence models designed to close clinical knowledge gaps, especially with regard to diagnosis and prognostication. These models emerged within a unique regulatory context that largely defers governance of clinical decision support tools. As such, we raise three concerns about the implementation of clinical artificial intelligence models, using COVID-19 as an important case study. First, flawed data underlying model development leads to flawed clinical resources. Second, models developed within one focus of geographic space and time leads to challenges in generalizability between clinical environments. Third, failure to implement ongoing monitoring locally leads to diminishing utility as diseases and implicated populations inevitably change. Experience with this pandemic has informed our assertion that machine learning models should be robustly vetted by facilities using local data to ensure that emerging technology does patients more good than harm.

Over the course of 2020, artificial intelligence researchers published at least 117 peer reviewed papers.1 Each showcased a new clinically-oriented machine learning model performing tasks including estimating the pre-test probability of SARS-CoV-2 infection, reading chest imaging, or predicting prognosis risk. As enticing as these new technologies seem, facilities should take cautious approaches to using these models in clinical practice, just as with other areas of innovation adoption in medicine. Models should be tested in each individual environment before use.


Drugs and devices undergo strict vetting processes before hospitals can utilize them. The FDA considers concrete bodies of evidence, including laboratory studies and clinical trials, to ensure safety and efficacy before permitting market entrance. In contrast, the FDA does not evaluate the majority of the clinically applicable models developed for COVID-19. They fall under a designation, called clinical decision support, that exempts them from regulatory oversight.2 Clinical decision support systems serve information to a provider who then aggregates with other data to make a clinical decision regarding patient care. As such, the use of these models occurs in a caveat emptor environment.


This exemption does not equate to suitability for use. Specifically, given the lack of formal validation, poorly constructed machine learning tools may lead decision-making astray. While models can fail for a number of reasons, the COVID-19 pandemic offers three unique challenges.


First, models are only as good as the data used to train them. The novel nature of COVID-19, particularly for models developed earlier in the pandemic, limited the size of datasets used to train algorithms. In extreme examples, models marketed to help COVID-19 patients were trained without using COVID-19 patients at all.3 One such extreme example, the Epic Deterioration Index model, is easily accessible in our EHR and a simple addition to a user’s screen. While Epic has advertised this model for COVID-19 risk stratification, it has not made data or performance results public.4 Generalization studies show conflicting results, generating confusion as to whether, for instance, the model overestimates or underestimates risks of poor outcomes.5 The harms of suboptimal data utilization have been documented with respect to a variety of outcomes, such as the potential exacerbation of healthcare inequities.6,7


The second unique issue concerns generalizability. A model’s usefulness derives not from how well it performs on retrospective data used to train it but rather from how well it performs on prospective data it has not seen before. If patient data used for training looks significantly different from the patient data that the model will be used on, the tool will perform worse than expected. These considerations become particularly pertinent for a pandemic which, through its temporal course thus far, has found various geographic epicenters with different demographics. Taking a single snapshot at one time point may fail to inform a subsequent stage of the pandemic. These concerns have practically borne out in the external validation of published models.8


Third, applications should include robust performance monitoring infrastructure. Models should consistently be interrogated against new incoming data to ensure the model is performing as expected. This process protects against model drift, which is the expected, well documented decrementation in model quality as a result of inevitable shifts in characteristics of disease processes, treatments, or impacted populations. Perhaps the most pressing concern in the context of COVID-19 is the emergence of novel variants. The efficacy of models should be re-evaluated in the same manner that the efficacy of vaccines and therpeutics is re-evaluated. For example, we have programming code that runs daily to validate a favorable outcome model deployed for an ongoing randomized controlled trial.9,10 Models designed for parsimonious use by physicians do not include tools for monitoring, and shifts in performance cannot easily be determined.


Our concerns are compounded by the accessibility of machine learning tools. Developers publishing clinically relevant models often simplify the number of features used and present their tools in clinician-friendly forms. Models can be found as online calculators, whether on developer-created websites or at centralized resources such as MDCalc. These forms simplify use but also remove barriers to using models that have not been vetted.


Facilities internationally have seen relatively unpredictable ebbs and surges in patient volumes.11 Should hospitals become overwhelmed, medical teams may reach out for tools to help sick patients. The use of artificial intelligence to face unmet needs in the clinical setting will only grow from this point on. The particularities of this market’s “buyer beware” environment highlight the need for hospitals to effectively test any clinical models they seek to apply clinically on their own data. Artificial intelligence holds great promise to assist patients and physicians, but only when applied carefully and thoughtfully. 


  1. Raza K. Artificial Intelligence Against COVID-19: A Meta-analysis of Current Research. In: Hassanien A-E, Dey N, Elghamrawy S, eds. Big Data Analytics and Artificial Intelligence Against COVID-19: Innovation Vision and Approach. Cham: Springer International Publishing, 2020: 165–76.
  2. Clinical Decision Support Software – Draft Guidance for Industry and Food and Drug Administration Staff. ; : 27.
  3. DeCaprio D, Gartner J, McCall CJ, et al. Building a COVID-19 vulnerability index. J Med Artif Intell 2020; 3. DOI:10.21037/jmai-20-47.
  4. Epic AI Helps Clinicians Predict When COVID-19 Patients Might Need Intensive Care. (accessed Nov 18, 2020).
  5. Wynants L, Van Calster B, Collins GS, et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ 2020; : m1328.
  6. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. Published online October 25, 2019.
  7. Leslie D, Mazumder A, Peppin A, Wolters MK, Hagerty A. Does “AI” stand for augmenting inequality in the era of covid-19 healthcare? BMJ. 2021;372:n304.
  8. Harish K, Zhang B, Stella P, et al. Validation of parsimonious prognostic models for patients infected with COVID-19. BMJ Health Care Inform 2021; 28: e100267.
  9. Razavian N, Major VJ, Sudarshan M, et al. A validated, real-time prediction model for favorable outcomes in hospitalized COVID-19 patients. Npj Digit Med 2020; 3: 1–13.
  10. NYU Langone Health. Predicting Favorable Outcomes in Hospitalized Covid-19 Patients., 2020 (accessed Nov 17, 2020).
  11. Daily Testing Trends in the US – Johns Hopkins. Johns Hopkins Coronavirus Resour. Cent. (accessed Oct 28, 2020).

About the Authors

Keerthi B Harish, BA

Keerthi Harish is a medical student at the NYU Grossman School of medicine whose research interests include the effective operational implementation of machine learning technologies in healthcare. He completed undergraduate education in public health at the Johns Hopkins University.

Yindalon Aphinyanaphongs, MD, PhD

Dr. Yin Aphinyanaphongs is an assistant professor in the Department of Population Health at the NYU Grossman School of Medicine whose research interests include the development and operational implementation of machine learning technologies in healthcare. He completed his medical and scientific training at Vanderbilt University.