Evaluation of large language model-generated medical information on idiopathic pulmonary fibrosis

  • Iván Cherrez-Ojeda
  • , Björn Christian Frye
  • , Andreas Hoheisel
  • , Arturo Cortes-Telles
  • , Karla Robles-Velasco
  • , Heidegger N. Mateos-Toledo
  • , Ricardo G. Figueiredo
  • , Christopher J. Ryerson
  • , Gabriela Rodas-Valero
  • , Juan Carlos Calderón

Producción científica: Contribución a una revistaArtículorevisión exhaustiva

Resumen

Background: Idiopathic Pulmonary Fibrosis (IPF) information from AI-powered large language models (LLMs) like ChatGPT-4 and Gemini 1.5 Pro is unexplored for quality, reliability, readability, and concordance with clinical guidelines. Research question: What is the quality, reliability, readability, and concordance to clinical guidelines of LLMs in medical and clinically IPF-related content? Study design and methods: ChatGPT-4 and Gemini 1.5 Pro responses to 23 ATS/ERS/JRS/ALAT IPF guidelines questions were compared. Six independent raters evaluated responses for quality (DISCERN), reliability (JAMA Benchmark Criteria), readability (Flesch–Kincaid), and guideline concordance (0–4). Descriptive analysis, Intraclass Correlation Coefficient, Wilcoxon signed-rank test, and effect sizes (r) were calculated. Statistical significance was set at p < 0.05. Results: According to JAMA Benchmark, ChatGPT-4 and Gemini 1.5 Pro provided partially reliable responses; however, readability evaluations showed that both models were difficult to understand. The Gemini 1.5 Pro provided significantly better treatment information (DISCERN score: 56 versus 43, p < 0.001). Gemini had considerably higher international IPF guidelines concordance than ChatGPT-4 (median 3.0 [3.0–3.5] vs. 3.0 [2.5–3.0], p = 0.0029). Interpretation: Both models gave useful medical insights, but their reliability is limited. Gemini 1.5 Pro gave greater quality information than ChatGPT-4 and was more compliant with worldwide IPF guidelines. Readability analyses found that AI-generated medical information was difficult to understand, stressing the need to refine it. What is already known on this topic: Recent advancements in AI, especially large language models (LLMs) powered by natural language processing (NLP), have revolutionized the way medical information is retrieved and utilized. What this study adds: This study highlights the potential and limitations of ChatGPT-4 and Gemini 1.5 Pro in generating medical information on IPF. They provided partially reliable information in their responses; however, Gemini 1.5 Pro demonstrated superior quality in treatment-related content and greater concordance with clinical guidelines. Nevertheless, neither model provided answers in full concordance with established clinical guidelines, and their readability remained a major challenge. How this study might affect research, practice or policy: These findings highlight the need for AI model refinement as LLMs evolve as healthcare reference tools to help doctors and patients make evidence-based decisions.

Idioma originalInglés
Número de artículo1618378
PublicaciónFrontiers in Artificial Intelligence
Volumen8
DOI
EstadoPublicada - 2025

Huella

Profundice en los temas de investigación de 'Evaluation of large language model-generated medical information on idiopathic pulmonary fibrosis'. En conjunto forman una huella única.

Citar esto