Hospitals don’t always test AI for accuracy and bias, study says

Share:

an inboxologist using a laptop

Photo by Kaboompics.com via Pexels

New research published in Health Affairs indicates that hospitals often don’t assess AI models using their patient data before deploying them, potentially putting patients at risk.

So-called local evaluation to check how well a predictive model performs in the unique setting of a particular hospital — as opposed to testing on a generic dataset — is considered essential for validating safety and effectiveness. 

An analysis of 2023 American Hospital Association survey data found 65% of nonfederal hospitals used predictive models that were integrated into their electronic health records (EHRs). However, only 44% performed in-house evaluations of their models for bias while 61% conducted local evaluations for accuracy. 

What’s more, hospitals with greater resources were disproportionately more likely to do testing.

For journalists, the study highlights the importance of examining how carefully individual hospitals deploy AI and whether the industry takes steps to avert a divide between rich and poor hospitals.

“We don’t want to see a situation where hospitals with lots of resources can protect their patients and make sure tools aren’t biased and are operating well, while the hospitals without those resources are struggling to keep their patients safe,” lead author Paige Nong, Ph.D., an assistant professor at the University of Minnesota School of Public Health, said in an interview. 

A halt to regulation

The Biden administration urged industry adherence to “FAVES” principles—promoting outcomes that are Fair, Appropriate, Valid, Effective, and Safe. A final rule issued last year by HHS required certified health IT developers to provide users with information to enable them to evaluate AI tools. 

However, President Donald Trump rescinded the Biden administration’s executive order regulating artificial intelligence, Bloomberg Law reported. The move halted the implementation of key safety and transparency requirements for AI developers. 

Nong and her team noted that some hospitals might forgo robust evaluation due to cost, underestimation of potential harms, lack of technical capacity, absence of financial incentives, or a perception that it’s not required.

They found:

  • Local evaluation was more common among hospitals that had developed their own predictive models, had high operating margins, and belonged to a multi-hospital health system.  
  • Critical access hospitals, other rural hospitals, and hospitals serving areas high on the Social Deprivation Index were less likely to use AI.
  • Most hospitals reported that predictive tools came from their EHR vendor (79%), followed by third parties (59%) and self-development (54%).

The industry urgently needs guidelines for how models are developed and deployed as well as policies to provide education and technical support for evaluation to  “ensure that all hospitals can use predictive models safely,” they wrote.

Nong pointed to the National Coordinator for Health Information Technology’s establishment of regional extension centers to help medical facilities implement EHRs as a potential model for how technical assistance might be delivered.

AI is already widely deployed

Hospitals have already deployed AI for a wide range of clinical and administrative tasks, the data show. (See chart.) 

Hospitals were less likely to evaluate models used for outpatient follow-up and billing than for clinical care, possibly reflecting a misperception that administrative applications present fewer risks.

Chart: Hospitals’ reported uses of machine learning and other predictive models. Predicting health trajectories or risks for inpatients, 92%; Identify high-risk outpatients for follow-up care, 79%; Facilitate scheduling, 51%; Recommend treatments, 44%; Simplify or automate billing procedures, 36%; Monitor health, 34%; Other (clinical use), 34%; Other (operational process optimization), 25%. Source: Nong et. al (2025)

In fact, predictive tools for billing and scheduling “can end up exacerbating disparities or building barriers to care,” said Nong, whose team recommended adoption of a set of best practices for AI that does not directly guide diagnosis or treatment.

The pointed to the example of clinicians at the University of California, San Francisco Health System, who determined that a EHR tool to optimize scheduling might harm their patients by using sensitive personal characteristics such as ethnicity, financial class and body mass index to predict the probability that a patient would miss an appointment.

The software double-booked likely no-shows, which meant that if both the originally scheduled patient and the overbooked patient showed up, they would have less time time with a clinician. Instead of double booking, the health system opted to address likely no-shows with “patient positive” interventions such as flexible appointment times, telehealth visits, and assistance with transportation or childcare. 

The AHA annual survey, conducted from March to August of 2023, was the first by the trade group to include questions about predictive AI. Of all U.S. hospitals, 58%, or 2,547, responded. Only nonfederal hospitals, numbering 2,425, were included in the analysis. 

Nong said her team is in the process of evaluating 2024 survey data, which debuted questions about large language models such as ambient scribes. 

In addition to learning about the prevalence of AI for transcribing physician-patient conversations, Nong said, “We want to see if that divide between the well-resourced and under-resourced hospitals is growing or narrowing.”

Resources

Mary Chris Jaklevic

Mary Chris Jaklevic is an independent journalist based in Chicago. She served on AHCJ’s board for two terms and was formerly AHCJ’s health beat leader for patient safety.