Abstract: SA-PO003
Enhancing Large Language Models (LLM) Performance in Nephrology through Prompt Engineering: A Comparative Analysis of ChatGPT-4 Responses in Answering AKI and Critical Care Nephrology Questions
Session Information
- Augmented Intelligence, Large Language Models, and Digital Health
October 26, 2024 | Location: Exhibit Hall, Convention Center
Abstract Time: 10:00 AM - 12:00 PM
Category: Augmented Intelligence, Digital Health, and Data Science
- 300 Augmented Intelligence, Digital Health, and Data Science
Authors
- Sheikh, M. Salman, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Thongprayoon, Charat, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Qureshi, Fawad, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Abdelgadir, Yasir, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Craici, Iasmina, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Kashani, Kianoush, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Cheungpasitporn, Wisit, Mayo Clinic Minnesota, Rochester, Minnesota, United States
Background
Large Language Models (LLMs) have significantly advanced the field of artificial intelligence (AI). The effectiveness of LLMs is substantially influenced by the structure and formulation of input queries, a process known as prompt engineering. Prompt engineering techniques, such as the chain of thought approach, which involves thinking through problems step by step, have shown promising accuracy compared to regular prompts. This study investigates the impact of the chain of thought approach on the accuracy of ChatGPT-4 in addressing acute kidney injury (AKI) and critical care nephrology questions.
Methods
We presented ChatGPT-4 with 101 questions from the Kidney Self-Assessment Program (KSAP) and Nephrology Self-Assessment Program (NephSAP). We employed two prompting methods: one using the original question and the other utilizing the chain of thought approach. The McNemar test was used to assess differences in accuracy, while Cohen's kappa was employed to evaluate agreement between the two prompting methods.
Results
ChatGPT-4 demonstrated an accuracy of 87.1% with chain of thought prompting, outperforming the 81.2% accuracy achieved with regular prompting (P=0.15). The kappa statistic for the responses provided by the two prompts is 0.80. Consistency between the two methods was observed in 84.2% of the questions, with 78.2% being correctly answered by both methods. Chain of thought prompting correctly answered nine questions that were missed under regular prompting. Among the thirteen questions missed under chain of thought prompting, a notable 76.9% were repeated errors from regular prompting. Only three questions incorrectly answered with the chain of thought prompting were correct under regular prompting.
Conclusion
The chain of thought approach improves ChatGPT-4's accuracy in addressing nephrology-related questions compared to regular prompting, although the difference is not statistically significant. These findings emphasize the importance of developing effective prompting strategies to optimize the application of LLMs in clinical decision support. Future research should aim to generalize these findings across different medical specialties to maximize the benefits of LLMs in clinical decision-making.