TY - JOUR
T1 - A comparative analysis of generative artificial intelligence responses from leading chatbots to questions about endometriosis
AU - Cohen, Natalie D.
AU - Ho, Milan
AU - McIntire, Donald
AU - Smith, Katherine
AU - Kho, Kimberly A.
N1 - Publisher Copyright:
© 2024 The Authors
PY - 2025/2
Y1 - 2025/2
N2 - Introduction: The use of generative artificial intelligence (AI) has begun to permeate most industries, including medicine, and patients will inevitably start using these large language model (LLM) chatbots as a modality for education. As healthcare information technology evolves, it is imperative to evaluate chatbots and the accuracy of the information they provide to patients and to determine if there is variability between them. Objective: This study aimed to evaluate the accuracy and comprehensiveness of three chatbots in addressing questions related to endometriosis and determine the level of variability between them. Study Design: Three LLMs, including Chat GPT-4 (Open AI), Claude (Anthropic), and Bard (Google) were asked to generate answers to 10 commonly asked questions about endometriosis. The responses were qualitatively compared to current guidelines and expert opinion on endometriosis and rated on a scale by nine gynecologists. The grading scale included the following: (1) Completely incorrect, (2) mostly incorrect and some correct, (3) mostly correct and some incorrect, (4) correct but inadequate, (5) correct and comprehensive. Final scores were averaged between the nine reviewers. Kendall's W and the related chi-square test were used to evaluate the reviewers’ strength of agreement in ranking the LLMs’ responses for each item. Results: Average scores for the 10 answers amongst Bard, Chat GPT, and Claude were 3.69, 4.24, and 3.7, respectively. Two questions showed significant disagreement between the nine reviewers. There were no questions the models could answer comprehensively or correctly across the reviewers. The model most associated with comprehensive and correct responses was ChatGPT. Chatbots showed an improved ability to accurately answer questions about symptoms and pathophysiology over treatment and risk of recurrence. Conclusion: The analysis of LLMs revealed that, on average, they mainly provided correct but inadequate responses to commonly asked patient questions about endometriosis. While chatbot responses can serve as valuable supplements to information provided by licensed medical professionals, it is crucial to maintain a thorough ongoing evaluation process for outputs to provide the most comprehensive and accurate information to patients. Further research into this technology and its role in patient education and treatment is crucial as generative AI becomes more embedded in the medical field.
AB - Introduction: The use of generative artificial intelligence (AI) has begun to permeate most industries, including medicine, and patients will inevitably start using these large language model (LLM) chatbots as a modality for education. As healthcare information technology evolves, it is imperative to evaluate chatbots and the accuracy of the information they provide to patients and to determine if there is variability between them. Objective: This study aimed to evaluate the accuracy and comprehensiveness of three chatbots in addressing questions related to endometriosis and determine the level of variability between them. Study Design: Three LLMs, including Chat GPT-4 (Open AI), Claude (Anthropic), and Bard (Google) were asked to generate answers to 10 commonly asked questions about endometriosis. The responses were qualitatively compared to current guidelines and expert opinion on endometriosis and rated on a scale by nine gynecologists. The grading scale included the following: (1) Completely incorrect, (2) mostly incorrect and some correct, (3) mostly correct and some incorrect, (4) correct but inadequate, (5) correct and comprehensive. Final scores were averaged between the nine reviewers. Kendall's W and the related chi-square test were used to evaluate the reviewers’ strength of agreement in ranking the LLMs’ responses for each item. Results: Average scores for the 10 answers amongst Bard, Chat GPT, and Claude were 3.69, 4.24, and 3.7, respectively. Two questions showed significant disagreement between the nine reviewers. There were no questions the models could answer comprehensively or correctly across the reviewers. The model most associated with comprehensive and correct responses was ChatGPT. Chatbots showed an improved ability to accurately answer questions about symptoms and pathophysiology over treatment and risk of recurrence. Conclusion: The analysis of LLMs revealed that, on average, they mainly provided correct but inadequate responses to commonly asked patient questions about endometriosis. While chatbot responses can serve as valuable supplements to information provided by licensed medical professionals, it is crucial to maintain a thorough ongoing evaluation process for outputs to provide the most comprehensive and accurate information to patients. Further research into this technology and its role in patient education and treatment is crucial as generative AI becomes more embedded in the medical field.
KW - chatbots
KW - endometriosis education
KW - health information technology
KW - large language models
KW - patient education
KW - patient information
UR - http://www.scopus.com/inward/record.url?scp=85212614501&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85212614501&partnerID=8YFLogxK
U2 - 10.1016/j.xagr.2024.100405
DO - 10.1016/j.xagr.2024.100405
M3 - Article
C2 - 39810943
AN - SCOPUS:85212614501
SN - 2666-5778
VL - 5
JO - AJOG Global Reports
JF - AJOG Global Reports
IS - 1
M1 - 100405
ER -