Clinical Validation Of Generative AI For Chest Radiograph Reporting

Key Takeaways
- AI-generated reports were similar to radiologist-written reports under the standard acceptability criterion, but were less likely to meet the revision-free standard.
- For referable abnormalities, AI-generated reports were more sensitive and less specific than radiologist-written reports.
- Most surveyed thoracic radiologists did not consider the AI-generated reports reliable enough to replace radiologist-written reports.
Chest radiographs came from an ICU, an emergency department, health checkups, and an outpatient public dataset, and seven thoracic radiologists assessed the clinical acceptability of each report. Under the standard criterion, a report was acceptable without revision or with minor revision, whereas the stringent criterion required acceptability without revision. Radiographs from 1539 individuals were included, with a median age of 55 years, and collection ran retrospectively from January 2020 through December 2022, with outpatient chest radiographs sourced from a public dataset. The cohort included 656 male patients, 483 female patients, and 400 patients of unknown sex, while direct comparison with radiologist-written reports was limited to the three non-ICU contexts. These settings and definitions framed both the overall acceptability analysis and the separate comparison of referable abnormality detection.
Under the stringent criterion, AI-generated reports were acceptable without revision in 66.8% of cases, versus 75.7% for radiologist-written reports, with P<.001. For referable abnormalities, AI-generated reports showed higher sensitivity than radiologist-written reports, 81.2% versus 59.4%, with P<.001, while specificity was lower at 81.0% versus 93.6%, also with P<.001. Acceptability comparisons used a generalized linear mixed model, and sensitivity and specificity comparisons used McNemar testing; a substantial proportion of AI-generated reports still needed minor revision. Performance varied according to the threshold and outcome measured.
In the survey, most radiologists said the AI-generated reports were not yet reliable enough to replace radiologist-written reports. That view matched the mixed pattern across the comparison measures in non-ICU settings. AI output approached radiologist-written reports under the broader acceptability standard, but it fell short when reviewers required reports to stand without revision. Overall, the findings supported partial acceptability in some settings, but not report replacement in the view of the surveyed radiologists.