Conference Coverage

Is ChatGPT smarter than a PCP?


 

AT RCGP 2023

ChatGPT failed to pass the U.K.’s National Primary Care examinations in a new study, highlighting how artificial intelligence (AI) does not necessarily match human perceptions of medical complexity.

ChatGPT also provided novel explanations – it frequently “hallucinates” – by describing inaccurate information as if they were facts, according to Shathar Mahmood, BA, a fifth-year medical student at the University of Cambridge, England, who presented the findings at the annual meeting of the Royal College of General Practitioners. The study was published in JMIR Medical Education earlier this year.

“Artificial intelligence has generated impressive results across medicine, and with the release of ChatGPT there is now discussion about these large language models taking over clinicians’ jobs,” Arun James Thirunavukarasu, MB BChir, of the University of Oxford, England, and Oxford University Hospitals NHS Foundation Trust, who is the lead author of the study, told this news organization.

Performance of AI on medical school examinations has prompted much of this discussion, often because performance does not reflect real-world clinical practice, he said. “We used the Applied Knowledge Test instead, and this allowed us to explore the potential and pitfalls of deploying large language models in primary care and to explore what further development of medical large language model applications is required.”

The researchers investigated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test. The computer-based, multiple-choice assessment is part of the U.K.’s specialty training to become a general practitioner (GP). It tests knowledge behind general practice within the context of the United Kingdom’s National Health Service.

The researchers entered a series of 674 questions into ChatGPT on two occasions, or “runs.” “By putting the questions into two separate dialogues, we hoped to avoid the influence of one dialogue on the other,” Ms. Mahmood said. To validate that the answers were correct, the ChatGPT responses were compared with the answers provided by the GP self-test and past articles.

Docs 1, AI 0

Overall performance of the algorithm was good across both runs (59.94% and 60.39%); 83.23% of questions produced the same answer on both runs.

But 17% of the answers didn’t match, Ms. Mahmood reported, a statistically significant difference. “And the overall performance of ChatGPT was 10% lower than the average RCGP pass mark in the last few years, which informs one of our conclusions about it not being very precise at expert level recall and decision-making,” she said.

Also, a small percentage of questions (1.48% and 2.25% in each run) produced an uncertain answer or there was no answer.

Say what?

Novel explanations were generated upon running a question through ChatGPT that then provided an extended answer, Ms. Mahmood said. When the accuracy of the extended answers was checked against the correct answers, no correlation was found. “ChatGPT can hallucinate answers, and there’s no way a nonexpert reading this could know it is incorrect,” she said.

Regarding the application of ChatGPT and similar algorithms to clinical practice, Ms. Mahmood was clear. “As they stand, [AI systems] will not be able to replace the health care professional workforce, in primary care at least,” she said. “I think larger and more medically specific datasets are required to improve their outputs in this field.”

Sandip Pramanik, MBcHB, a GP in Watford, Hertfordshire, England, said the study “clearly showed ChatGPT’s struggle to deal with the complexity of the exam questions that is based on the primary care system. In essence, this in indicative of the human factors involved in decision-making in primary care.”

The applied knowledge test is designed to test the knowledge required to be a generalist in the primary care setting, and as such, there are lots of nuances reflecting this within the questions, Dr. Pramanik said.

“ChatGPT may look at these in a more black and white way, whereas the generalist needs to be reflective of the complexities involved and the different possibilities that can present rather than take a binary ‘yes’ or ‘no’ stance,” he said. “In fact, this highlights a lot about the nature of general practice in managing uncertainty, and this is reflected in the questions asked in the exam,” he remarked. He noted, “Being a generalist is about factoring in human emotion and human perception as well as knowledge.”

Ms. Mahmood, Dr. Thirunavukarasu, and Dr. Pramanik have disclosed no relevant financial relationships.

A version of this article first appeared on Medscape.com.

Next Article: