Multicenter Clinical Validation of an Artificial Intelligence Diagnostic Classification Model for Laryngoscopy Images

Sampieri, Claudio; Mora, Francesco; Peretti, Giorgio; Larrosa, Marc; Vilaseca, Isabel; Avilés‐jurado, Francesc X.; Ioppi, Alessandro; Bellini, Elisa; Alegre, Berta; Ruiz‐sevilla, Laura; Srivastava, Rakesh; Sakellaridis, Athanasios C.; Razou, Andriana; Kotsis, Georgios P.; Moccia, Sara; Mattos, Leonardo S.; Baldini, Chiara

doi:10.1002/ohn.70153

Objective: To develop and externally validate a computer-aided diagnosis (CADx) model using artificial intelligence (AI) for classifying laryngeal lesions from laryngoscopy images into high-risk (HR), low-risk (LR). Study design: Retrospective multicenter development of a CADx model and external validation on independent cohorts. Setting: Multicenter tertiary referral hospitals (Italy, India, China, Greece, and Spain). Methods: Over 20,000 images derived from laryngoscopic examinations were retrieved. Images were annotated based on histopathology or expert consensus. A deep learning model was trained using an internal dataset and evaluated on 2 external datasets to assess generalizability. The CADx model classifies only images containing visible lesions, discriminating between LR and HR categories. Diagnostic performance was measured using standard metrics, including accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). Model performance was compared with physicians of varying expertise and ChatGPT-4o. Results: The computer-aided diagnosis model achieved a similar performance across internal and external datasets in distinguishing HR from LR lesions, with accuracy/AUC of 0.90/0.89 internally, 0.85/0.85 on the Greek dataset, and 0.88/0.88 on the Spanish dataset. The model's accuracy was statistically noninferior to that of otolaryngologists and expert laryngologists, and superior to general practitioners and ChatGPT-4o. Conclusion: This is a large multicenter clinical validation of a CADx model for laryngeal endoscopy, demonstrating generalizability and performance comparable to clinicians in discriminating between LR and HR lesions. The model's success supports its potential role in augmenting diagnostic capabilities, especially in resource-limited settings. A prospective multicenter clinical trial is underway to assess real-world clinical implementation.