Abstract: SA-PO001
Nephrology Tools: Using Chatbots for Image Interpretation and Answering Questions
Session Information
- Augmented Intelligence, Large Language Models, and Digital Health
October 26, 2024 | Location: Exhibit Hall, Convention Center
Abstract Time: 10:00 AM - 12:00 PM
Category: Augmented Intelligence, Digital Health, and Data Science
- 300 Augmented Intelligence, Digital Health, and Data Science
Authors
- Garcia Valencia, Oscar Alejandro, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Thongprayoon, Charat, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Krisanapan, Pajaree, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Suppadungsuk, Supawadee, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Cheungpasitporn, Wisit, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Miao, Jing, Mayo Clinic Minnesota, Rochester, Minnesota, United States
Background
Effective medical diagnosis and treatment planning rely heavily on clinical imaging. As of September 2023, ChatGPT-4 had broadened its functionality to allow image interpretation, providing comprehensive explanations and problem-solving insights. However, the effectiveness of chatbots in image interpretation remains understudied. We aimed to evaluate the performance of leading chatbots (i.e GPT-4, Bard AI, and Bing Chat), in interpreting nephrology images and addressing related test questions.
Methods
We assessed 57 nephrology test questions with their associated images from the Nephrology Self-Assessment Program and Kidney Self-Assessment Program tests. This group included 19 kidney histopathological images, 28 radiological images, and 10 images from miscellaneous categories. We ommited the image descriptions to focus only on the chatbots' interpretative abilities. Each question was analyzed two times through GPT-4, Bing Chat, and Bard AI. We then calculated and compared the accuracy and concordance rates for both image interpretation and question answering by these AI models.
Results
Regarding image interpretation, GPT-4 showed a 79% total accuracy, outperforming Bard AI’s 51% and Bing Chat’s 35% (p<0.001). All chatbots displayed a similar performance across image types (p=0.57, 0.39, and 0.38 for GPT-4, Bard AI and Bing Chat, respectively). On image-related questions, GPT-4, Bard, and Bing Chat yielded comparable results, with total accuracy rates of 60%, 53%, and 61% (p>0.05) and concordance rates of 75%, 68%, and 74% (p>0.05), respectively. Notably, GPT-4 and Bard AI were likely to provide correct answer when their image interpretation was accurate (correct response 60% vs 37%, p=0.01 for GPT-4; 60% vs 27%, p<0.001 for Bard AI when image interpretation was correct and incorrect respectively), while Bing Chat’s accuracy on answering questions did not differ based on its accuracy of image interpretation (58% vs 51%, p=0.50).
Conclusion
These results highlight the challenges in developing AI-driven chatbots for medical image interpretation in nephrology, with GPT-4 showeing remarkable potential. The study provides valuable insights for further refinement of AI models, emphasizing the importance of accuracy, especially in scenarios where medical decisions rely on precise image interpretations.