ASN's Mission

To create a world without kidney diseases, the ASN Alliance for Kidney Health elevates care by educating and informing, driving breakthroughs and innovation, and advocating for policies that create transformative changes in kidney medicine throughout the world.

learn more

Contact ASN

1401 H St, NW, Ste 900, Washington, DC 20005

email@asn-online.org

202-640-4660

The Latest on X

Kidney Week

Abstract: SA-PO003

Enhancing Large Language Models (LLM) Performance in Nephrology through Prompt Engineering: A Comparative Analysis of ChatGPT-4 Responses in Answering AKI and Critical Care Nephrology Questions

Session Information

Category: Augmented Intelligence, Digital Health, and Data Science

  • 300 Augmented Intelligence, Digital Health, and Data Science

Authors

  • Sheikh, M. Salman, Mayo Clinic Minnesota, Rochester, Minnesota, United States
  • Thongprayoon, Charat, Mayo Clinic Minnesota, Rochester, Minnesota, United States
  • Qureshi, Fawad, Mayo Clinic Minnesota, Rochester, Minnesota, United States
  • Abdelgadir, Yasir, Mayo Clinic Minnesota, Rochester, Minnesota, United States
  • Craici, Iasmina, Mayo Clinic Minnesota, Rochester, Minnesota, United States
  • Kashani, Kianoush, Mayo Clinic Minnesota, Rochester, Minnesota, United States
  • Cheungpasitporn, Wisit, Mayo Clinic Minnesota, Rochester, Minnesota, United States
Background

Large Language Models (LLMs) have significantly advanced the field of artificial intelligence (AI). The effectiveness of LLMs is substantially influenced by the structure and formulation of input queries, a process known as prompt engineering. Prompt engineering techniques, such as the chain of thought approach, which involves thinking through problems step by step, have shown promising accuracy compared to regular prompts. This study investigates the impact of the chain of thought approach on the accuracy of ChatGPT-4 in addressing acute kidney injury (AKI) and critical care nephrology questions.

Methods

We presented ChatGPT-4 with 101 questions from the Kidney Self-Assessment Program (KSAP) and Nephrology Self-Assessment Program (NephSAP). We employed two prompting methods: one using the original question and the other utilizing the chain of thought approach. The McNemar test was used to assess differences in accuracy, while Cohen's kappa was employed to evaluate agreement between the two prompting methods.

Results

ChatGPT-4 demonstrated an accuracy of 87.1% with chain of thought prompting, outperforming the 81.2% accuracy achieved with regular prompting (P=0.15). The kappa statistic for the responses provided by the two prompts is 0.80. Consistency between the two methods was observed in 84.2% of the questions, with 78.2% being correctly answered by both methods. Chain of thought prompting correctly answered nine questions that were missed under regular prompting. Among the thirteen questions missed under chain of thought prompting, a notable 76.9% were repeated errors from regular prompting. Only three questions incorrectly answered with the chain of thought prompting were correct under regular prompting.

Conclusion

The chain of thought approach improves ChatGPT-4's accuracy in addressing nephrology-related questions compared to regular prompting, although the difference is not statistically significant. These findings emphasize the importance of developing effective prompting strategies to optimize the application of LLMs in clinical decision support. Future research should aim to generalize these findings across different medical specialties to maximize the benefits of LLMs in clinical decision-making.