Clinical Validation Of Generative AI For Chest Radiograph Reporting

06/12/2026

Key Takeaways

AI-generated reports were similar to radiologist-written reports under the standard acceptability criterion, but were less likely to meet the revision-free standard.
For referable abnormalities, AI-generated reports were more sensitive and less specific than radiologist-written reports.
Most surveyed thoracic radiologists did not consider the AI-generated reports reliable enough to replace radiologist-written reports.

In a multicohort retrospective study of automated chest radiograph reporting, AI-generated reports reached 88.4% acceptability under the standard criterion, compared with 89.2% for radiologist-written reports. Thoracic radiologists reviewed reports across several care settings to judge whether each was clinically acceptable. The comparison centered on whether AI-generated text could meet routine report standards rather than simply identify findings. In direct comparisons across the non-ICU settings, the standard threshold showed no evidence of a difference in acceptability, while stricter review was less favorable to AI output. Comparable standard acceptability therefore did not extend to the stricter threshold or to replacement-level confidence.

Chest radiographs came from an ICU, an emergency department, health checkups, and an outpatient public dataset, and seven thoracic radiologists assessed the clinical acceptability of each report. Under the standard criterion, a report was acceptable without revision or with minor revision, whereas the stringent criterion required acceptability without revision. Radiographs from 1539 individuals were included, with a median age of 55 years, and collection ran retrospectively from January 2020 through December 2022, with outpatient chest radiographs sourced from a public dataset. The cohort included 656 male patients, 483 female patients, and 400 patients of unknown sex, while direct comparison with radiologist-written reports was limited to the three non-ICU contexts. These settings and definitions framed both the overall acceptability analysis and the separate comparison of referable abnormality detection.

Under the stringent criterion, AI-generated reports were acceptable without revision in 66.8% of cases, versus 75.7% for radiologist-written reports, with P<.001. For referable abnormalities, AI-generated reports showed higher sensitivity than radiologist-written reports, 81.2% versus 59.4%, with P<.001, while specificity was lower at 81.0% versus 93.6%, also with P<.001. Acceptability comparisons used a generalized linear mixed model, and sensitivity and specificity comparisons used McNemar testing; a substantial proportion of AI-generated reports still needed minor revision. Performance varied according to the threshold and outcome measured.

In the survey, most radiologists said the AI-generated reports were not yet reliable enough to replace radiologist-written reports. That view matched the mixed pattern across the comparison measures in non-ICU settings. AI output approached radiologist-written reports under the broader acceptability standard, but it fell short when reviewers required reports to stand without revision. Overall, the findings supported partial acceptability in some settings, but not report replacement in the view of the surveyed radiologists.

CME Learning Centers

CME/CE Topic Areas

Spotlight On:

Lifestyle

Trending Topics

Clinical Validation Of Generative AI For Chest Radiograph Reporting

Program Chapters

Segment Chapters

Playlist:

Recommended

Clinical Validation Of Generative AI For Chest Radiograph Reporting

Title

Program Chapters

Segment Chapters

Playlist:

Recommended

Get a Dose of ReachMD in Your Inbox and Practice Smarter Medicine