Conference Coverage

AI chatbot ‘hallucinates’ faulty medical intelligence


 

FROM IDWEEK 2023

Artificial intelligence (AI) models are typically a year out of date and have this “charming problem of hallucinating made-up data and saying it with all the certainty of an attending on rounds,” Isaac Kohane, MD, PhD, Harvard Medical School, Boston, told a packed audience at plenary at an annual scientific meeting on infectious diseases.

Dr. Kohane, chair of the department of biomedical informatics, says the future intersection between AI and health care is “muddy.”

Echoing questions about the accuracy of new AI tools, researchers at the meeting presented the results of their new test of ChatGPT.

The AI chatbot is designed for language processing – not scientific accuracy – and does not guarantee that responses to medical queries are fully factual.

To test the accuracy of ChatGPT’s version 3.5, the researchers asked it if there are any boxed warnings on the U.S. Food and Drug Administration’s label for common antibiotics, and if so, what they are.

ChatGPT provided correct answers about FDA boxed warnings for only 12 of the 41 antibiotics queried – a matching rate of just 29%.

For the other 29 antibiotics, ChatGPT either “incorrectly reported that there was an FDA boxed warning when there was not, or inaccurately or incorrectly reported the boxed warning,” Rebecca Linfield, MD, infectious diseases fellow, Stanford (Calif.) University, said in an interview.

Uncritical AI use risky

Nine of the 41 antibiotics included in the query have boxed warnings. And ChatGPT correctly identified all nine, but only three were the matching adverse event (33%). For the 32 antibiotics without an FDA boxed warning, ChatGPT correctly reported that 28% (9 of 32) do not have a boxed warning.

For example, ChatGPT stated that the antibiotic fidaxomicin has a boxed warning for increased risk for Clostridioides difficile, “but it is the first-line antibiotic used to treat C. difficile,” Dr. Linfield pointed out.

ChatGPT also reported that cefepime increased the risk for death in those with pneumonia and fabricated a study supporting that assertion. “However, cefepime is a first-line drug for those with hospital-acquired pneumonia,” Dr. Linfield explained.

“I can imagine a worried family member finding this through ChatGPT, and needing to have extensive reassurances from the patient’s physicians about why this antibiotic was chosen,” she said.

ChatGPT also incorrectly stated that aztreonam has a boxed warning for increased mortality.

“The risk is that both physicians and the public uncritically use ChatGPT as an easily accessible, readable source of clinically validated information, when these large language models are meant to generate fluid text, and not necessarily accurate information,” Dr. Linfield told this news organization.

Dr. Linfield said that the next step is to compare the ChatGPT 3.5 used in this analysis with ChatGPT 4, as well as with Google’s Med-PaLM 2 after it is released to the public.

Advancing fast

At plenary, Dr. Kohane pointed out that AI is a quick learner and improvements in tools are coming fast.

As an example, just 3 years ago, the best AI tool could score about as well as the worst student taking the medical boards, he told the audience. “Three years later, the leading large language models are scoring better than 90% of all the candidates. What’s it going to be doing next year?” he asked.

“I don’t know,” Dr. Kohane said, “but it will be better than this year.” AI will “transform health care.”

A version of this article first appeared on Medscape.com.

Next Article: