![SonomaEye-full-color-vector[2].png](https://static.wixstatic.com/media/bed00b_da20b9cc704044b0b048ac8ef9190470~mv2_d_3024_2195_s_2.png/v1/fill/w_177,h_122,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/SonomaEye-full-color-vector%5B2%5D.png)
The Role of LLMs in Creating and Judging Care Plans for Uveitis Management
Presenter:
Taylor Crook, MD
Authors:
Taylor Crook1, Pooya Khosravi1,2, Jordan Tang1, Paul Zhou MD1, Olivia Lee MD1
1 Department of Ophthalmology, School of Medicine, University of California, Irvine, CA 92697, USA
2 Department of Computer Science, Donald Bren School of Information and Computer Sciences, University of California, Irvine, CA 92697, USA
Affiliation:
Purpose: Incorporating artificial intelligence into patient care has increased at unprecedented levels, allowing physicians to be more efficient with their workload. This study looked at whether or not we could train an open-source large language model (LLM), Llama 3.2, on uveitic patients' clinical notes and have it create an accurate workup by using an LLM-as-a-judge.
Method: This LLM will be prompted to be trained as an ophthalmologist and will then be shown the clinically relevant data from previous visits of a specific patient. The LLM will then be prompted with the clinical exam data from a current visit to create a workup for the patient. Deepseek-R1 will then compare these predictions with the ground truth based on relevance, appropriateness, specificity, and clinical context on a scale of 1-5. Data was acquired with permission from UCI IRB and all models operated within a secure environment.
Results: The LLM achieved a mean relevance score of 4.23 ± 1.07, indicating high accuracy in identifying clinically relevant information. However, its appropriateness and clinical context scores were lower, averaging at 3.67 ± 1.09 and 3.30 ± 1.53, respectively. The results suggest that the LLM can generate accurate workups but may lack completeness and contextualization. The LLM's specificity score was slightly higher at 3.73 ± 1.26, meaning it successfully identified critical aspects of each patient's condition. Deepseek-R1’s feedback suggests that improving the LLM's ability to integrate comprehensive information from various patient data sources will enhance its overall performance.
Conclusions: The study demonstrates the potential of LLMs to support uveitic patients' care by generating accurate plans/assessments based on clinical notes. While the LLM achieved promising results, further refinement is needed to improve its overall performance. These findings have implications for the integration of AI-powered tools into clinical workflows, and underscore the need for careful consideration of contextual information for accurate decision support. Future studies should focus on further fine-tuning the LLM's performance, exploring strategies to improve its generalizability across diverse patient populations, and investigating its potential as a valuable tool for clinicians in uveitic care.