Presentation Information

[O-3-02]Comparing the Psychiatric Diagnostic Capabilities of ChatGPT and other Generalist and Simulated Reasoning-based Large Language Models

*Karthik V Sarma, Kaitlin E Hanss, Anne L Glowinski, Andrew Krystal (University of California, San Francisco(United States of America))
PDF DownloadDownload PDF

Keywords:

artificial intelligence,diagnosis,large language models,data science,informatics

Background: The last two years have seen the advent of large language models (LLMs), a form of artificial intelligence (AI) that has shown promise in natural language understanding. Recent studies have demonstrated that the majority (78.4%) of patients are willing to use ChatGPT for self-diagnosis. Here, we evaluate the capabilities of these models to make psychiatric diagnoses, with attention to the comparison between generalist models and simulated reasoning (SR) models.

Methods: 28 full-text case diagnosis vignettes and associated diagnoses were retrieved from the DSM-5-TR Clinical Cases book. Five generalist models were selected: OpenAI’s gpt-4o, Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, Mistral’s Large 2, and Meta’s Llama 3.1 405B. Two SR models were selected: OpenAI’s o1 and Anthropic’s Claude 3.7 Sonnet. For each model, the positive predictive value and sensitivity for diagnosis were calculated for every vignette based on the predicted diagnoses and then averaged for a final result.

Results: The generalist LLMs exhibited a mean sensitivity (i.e., the proportion of author-designated diagnoses that were correctly predicted) between 71%-75% and a mean positive predictive value (PPV, i.e., the proportion of predicted diagnoses that were correct) between 50%-65%. No significant differences were found between the generalist models by the ANOVA test (p=0.48). The SR LLMs had mean sensitivity 72%-80% and PPV 70%-72%.

Conclusion: Both types of LLMs exhibited impressive out-of-the-box diagnostic performance. Howeve, the generalist models exhibited significant overdiagnosis, producing an average of 0.5-1 incorrect diagnoses per correct diagnosis. SR models had improved PPV, with approximately 50% fewer incorrect diagnoses per correct diagnosis. Our findings raise concern that patients using these models for self-diagnosis may be presented with excessive pathologization of their concerns.