Abstract: SA-PO005
Data Preprocessing: A Key Factor in Large Language Models' Performance in Critical Care Nephrology
Session Information
- Augmented Intelligence, Large Language Models, and Digital Health
October 26, 2024 | Location: Exhibit Hall, Convention Center
Abstract Time: 10:00 AM - 12:00 PM
Category: Augmented Intelligence, Digital Health, and Data Science
- 300 Augmented Intelligence, Digital Health, and Data Science
Authors
- Sheikh, M. Salman, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Thongprayoon, Charat, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Qureshi, Fawad, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Miao, Jing, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Craici, Iasmina, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Kashani, Kianoush, Mayo Clinic Minnesota, Rochester, Minnesota, United States
- Cheungpasitporn, Wisit, Mayo Clinic Minnesota, Rochester, Minnesota, United States
Background
In clinical evaluations, data is often encountered in the form of tables from outside sources. These tables can be processed as images or reformatted into text before being input into multimodal large language models (LLMs). The use of LLMs in medical education and decision-making is gaining traction. However, their accuracy in interpreting complex clinical data, particularly when presented as images rather than reformatted text, remains a concern. This study evaluates the impact of data formatting on the performance of ChatGPT-4 and Claude 3 Opus in answering critical care nephrology questions from the Kidney Self-Assessment Program (KSAP).
Methods
Fifty-six AKI and critical care nephrology questions from KSAP were reviewed, focusing on 46 questions that included tables with pertinent information such as laboratory values and diagnostic results. Initially, tables were inputted in an image-encoded format (screenshots), and the models' responses were recorded. Subsequently, the tables were reformatted into pure-text format, and the models were reassessed using the same questions. McNemar test assessed the statistical significance of the improvement in accuracy, and Cohen's Kappa test evaluated the agreement between pre-formatting and post-formatting answers for each model.
Results
In the initial run with tables in image-encoded format, ChatGPT-4 and Claude 3 Opus achieved accuracies of 50% and 43.5%, respectively. After reformatting from image-encoded format to pure-text based format, ChatGPT-4 and Claude 3 Opus' accuracies improved significantly to 73.9% and 60.87% (p<0.001), respectively. The Cohen's Kappa score for the agreement between GPT-4's pre-formatting and post-formatting answers is approximately 0.141, while the Cohen's Kappa score for the agreement between Claude 3 Opus's pre-formatting and post-formatting answers, after aligning the data for the same set of questions, is 0.350.
Conclusion
Data formatting significantly impacts the performance of LLMs in interpreting complex clinical data in critical care nephrology. Reformatting tables from image-encoded to pure-text format significantly improves the accuracy of ChatGPT-4 and Claude 3 Opus in answering KSAP questions. This highlights the importance of data preprocessing in optimizing LLM performance for clinical decision support.