Abstract: SA-PO009
Harnessing Artificial Intelligence (AI): Evaluating Models for Social Discourse Analysis among Patients with Kidney Disease in Online Communities
Session Information
- Augmented Intelligence, Large Language Models, and Digital Health
October 26, 2024 | Location: Exhibit Hall, Convention Center
Abstract Time: 10:00 AM - 12:00 PM
Category: Augmented Intelligence, Digital Health, and Data Science
- 300 Augmented Intelligence, Digital Health, and Data Science
Authors
- Anumolu, Rajesh, University of Massachusetts Chan Medical School, Worcester, Massachusetts, United States
- Delory, Aaron, IO Design Team LLC, Danvers, Massachusetts, United States
- Chandraker, Anil K., University of Massachusetts Chan Medical School, Worcester, Massachusetts, United States
Background
The rapid exchange of information through technology has transformed healthcare communication. Online patient communities generate data too quickly for traditional methods, offering opportunities for sentiment analysis to guide research and inform pharmaceutical development. While many researchers use OpenAI's Chat-GPT, the variability and nuances of different large language models (LLMs) are under-assessed. Understanding these differences is crucial for selecting the appropriate model for specific applications. This study compares four LLMs to evaluate their strengths and weaknesses in sentiment analysis.
Methods
The study analyzed 39,637 Reddit posts and 283,326 comments from 2011 to 2022 across subforums related to dialysis, kidney disease, kidney stones, and transplants. The following LLMs were used:
VADER: Rule-based model for social media sentiment analysis
RoBERTa: Transformer-based model optimized for sentiment analysis tasks
Chat-GPT 4o: Fine-tuned version of OpenAI's GPT-4
LLaMA 3 8b-instruct and Mistral 283k: Transformer-based models
Each model analyzed the text for sentiment (positive/negative/neutral) and provided confidence scores. The results were compared to assess agreement and model-specific biases.
Results
Visualized results on graphs.
Models' Biases:
-VADER: Tends to overestimate positive sentiment due to sensitivity to positive keywords.
-RoBERTa: Classifies more posts as neutral, possibly underrepresenting emotional extremes.
-GPT-4: Balanced sentiment with a slight positive bias, capturing nuanced positivity.
-LLaMA: Frequently predicts neutral sentiment, ensuring conservative interpretation.
-Mistral: Higher count of negative predictions, emphasizing detection of negative expressions.
Conclusion
Recognizing the variability among LLMs is crucial for selecting the right tool, as even this analysis of four models showed different emphases and nuances. In the best of circumstances, these models may provide insights into patient experiences, highlight health disparities, guide equitable interventions, and amplify marginalized voices. However, without proper evaluation, we may miss the mark. Future research should explore diverse datasets to enhance our understanding of these platforms.