When an AI scribe transcribes a doctor's conversation with a patient, everyone in the room seems happier. The physician can maintain eye contact instead of scribbling notes. The paperwork disappears. Early studies confirm what clinicians report anecdotally: ambient AI tools reduce burnout. But here is the question that researchers say the industry is failing to ask: What happens to the patient?
That gap—between what healthcare AI measures and what it should measure—has become the central tension in a growing body of academic criticism. Jenna Wiens, a computer scientist at the University of Michigan, and Anna Goldenberg of the University of Toronto laid out the case in a paper published this week in Nature Medicine. Their argument is direct: healthcare providers have begun rapidly deploying AI tools without rigorously assessing whether those tools improve patient outcomes. The technology works. The proof that it helps does not exist.
The distinction matters. A tool that accelerates X-ray interpretation may produce accurate results in isolation. But Wiens points to a chain of downstream questions that remain unanswered. How heavily will a radiologist lean on the AI's analysis? Does faster interpretation change treatment recommendations? Does it alter the doctor-patient interaction in ways that matter for adherence or follow-up care? "We just don't know," Wiens told MIT Technology Review.
This is not a narrow technical complaint. It reflects a structural problem in how healthcare AI is being validated and adopted. Regulatory pathways like FDA clearance emphasize safety and technical performance, not longitudinal health impact. Health systems deploying these tools measure what is easy to measure: time saved, notes completed, clinician satisfaction scores. Patient outcomes—recovery rates, diagnostic accuracy over time, hospital readmissions—require years of data collection and are harder to attribute to any single intervention.
The efficiency metrics are real. AI scribes genuinely reduce administrative burden, and clinician burnout is a documented crisis in American healthcare. But researchers like Wiens and Goldenberg argue that optimizing for provider experience without tracking downstream patient effects is a form of metrics theater—one that feels like progress without delivering it. The healthcare AI market is projected to reach billions of dollars annually on the strength of adoption rates and satisfaction surveys, not clinical trial outcomes.
Some pushback is warranted. Measuring patient outcomes in real-world clinical settings is genuinely difficult, and randomized controlled trials for AI tools face unique challenges, including rapid iteration cycles that outpace traditional study timelines. Critics might also argue that withholding promising technologies from patients while outcome data accumulates carries its own costs. These are not trivial objections.
But the asymmetry matters. If a hospital deploys an AI tool that saves clinicians two hours per day but does not measurably improve patient outcomes, the technology has succeeded on its own terms while failing on the terms that justify its existence. Wiens spent the first decade of her career pitching AI to clinicians. Over the past few years, she says, adoption accelerated dramatically—without the corresponding evaluation infrastructure. The switch flipped from "will this work?" to "it's already everywhere," and nobody built the mechanisms to answer whether it should be.
The Nature Medicine paper stops short of recommending specific moratoria. Instead, it calls for systematic outcome assessment as a standard practice alongside technical validation. That is a modest ask—essentially, measure what you claim to care about—yet it represents a significant departure from current industry norms. Until healthcare AI accepts that accountability, every efficiency gain will remain shadowed by an open question: helpful or just busy?