(CTN News) – In the medical community, ChatGPT AI continues to be controversial as it advances in complex medical scenarios doctors face every day.
In the big picture, doctors are grappling with questions about what constitutes an acceptable success rate for AI-supported diagnosis and whether AI’s reliability will hold up in the real world.
In a new study conducted by Mass General Brigham researchers, researchers tested ChatGPT’s performance on textbook-drawn case studies and found it was 72% accurate in making clinical decisions ranging from diagnosing possible conditions to making final decisions.
As Americans live longer and the population ages, AI could ultimately improve both efficiency and accuracy of diagnosis.
The U.S. has some of the best hospitals and physicians in the world, but it spent nearly twice as much on health care as the average advanced economy in 2021.
Mass General Brigham’s study is the first to assess the capacity of large language models across a broad spectrum of clinical care, rather than focusing on one particular area.
The study “comprehensively evaluates decision support via ChatGPT from the very beginning of working with a patient through the entire care scenario” including post-diagnosis care management, according to co-author Marc Succi, executive director of Mass General Brigham’s innovation incubator.
77% of the time, ChatGPT got the final diagnosis right. In cases where “differential diagnosis” was required, the bot’s success rate dropped to 60%.
A second study across 171 hospitals in the U.S. and the Netherlands found that a machine learning model called ELDER-ICU succeeded in determining the severity of illness in older adults admitted to intensive care units, thus assisting clinicians in identifying geriatric ICU patients who require more or earlier care.
Be smart: Despite the fact that AI has outperformed medical professionals on certain tasks, like cancer detection from medical imaging, many studies about AI’s medical uses haven’t translated into real world practice, and some critics say AI studies don’t take real clinical needs into account.
It’s worth noting that AI tests in a research setting come with no malpractice risk, unlike humans in real clinical settings.
While encouraged by the Mass Brigham study, Succi told Axios there’s still more to do to “bridge the gap from a useful machine learning model to actual clinical use.”
AI helps doctors “when little presenting information (is available) and a list of possible diagnoses is needed,” Succi said.
Large language models need to be improved in differential diagnosis before they’re ready for prime time, Succi said, adding that researchers should also examine ways to apply AI to hospital tasks that don’t require a final diagnosis, like triaging emergency rooms.
ChatGPT is starting to show the skills of a newly graduated doctor, Succi said. It’s hard to judge whether AI adds value to a doctor’s work since there aren’t any real benchmarks for success rates.
Getting ChatGPT or comparable AI models into hospitals will require more benchmark research and regulatory guidance, plus diagnostic success rates between 80% and 90%.