Radiologists Miss Deepfake Radiographs in Reader Study

05/11/2026

Key Takeaways

Only 7 of 17 blinded radiologists recognized that synthetic images were present, and post-disclosure radiologist accuracy was 75% in the GPT-4o dataset and 70% in the RoentGen dataset.
On the GPT-4o task, GPT-4o and GPT-5 reached 85% and 83%, while Llama 4 Maverick and Gemini 2.5 Pro reached 59% and 56%, and none detected all synthetic radiographs.
Recurring clues included bilateral symmetry, uniform grain or noise, unnatural soft-tissue texture, and smooth bone surfaces, and the authors concluded that training physicians and LLMs to recognize synthetic images is essential to mitigate risk.

In a retrospective diagnostic accuracy study conducted between April and August 2025, synthetic radiographs were not readily distinguished from authentic images by radiologists or tested multimodal LLMs. During blinded review, only 41% of radiologists, seven of 17, recognized that AI-generated radiographs were present in the dataset. Investigators compared authentic radiographs with synthetic images generated using GPT-4o and, in a separate chest-focused phase, with images generated using RoentGen. The study focused on image authenticity rather than broader clinical performance, with readers and models reviewing mixed authentic and synthetic sets. Detection remained imperfect after readers were told that some radiographs were synthetic.

Seventeen practicing radiologists from six countries, with varying levels of experience, participated in the reader study. The same classification task was also assigned to four multimodal LLMs: GPT-4o, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick. Phase 1 used 154 radiographs from multiple anatomic regions, including 77 GPT-4o-generated synthetic images and 77 authentic images, during a blinded review. Phase 2 followed disclosure of the study purpose and asked radiologists to label randomly presented radiographs as GPT-4o-generated or authentic. Phase 3 analyzed 110 chest radiographs, split between 55 RoentGen-generated synthetic images and 55 authentic images, framing the exercise as an authenticity task rather than a diagnostic interpretation exercise.

After disclosure, overall radiologist accuracy reached 75% for the GPT-4o dataset, with a 95% CI of 68 to 81. Accuracy in the RoentGen dataset was 70%, with a 95% CI of 62 to 78, and the comparison yielded P=.07. In the multimodal LLM panel, performance on the GPT-4o-generated radiograph task was 85% for GPT-4o and 83% for GPT-5. Llama 4 Maverick and Gemini 2.5 Pro were lower at 59% and 56%, respectively, with P<.001 for those model comparisons. No tested LLM detected all synthetic radiographs in either dataset.

Recurring visual cues in synthetic radiographs included bilateral symmetry and uniform grain or noise patterns. The authors also noted subtly unnatural soft-tissue textures and bone surfaces that appeared smoother than expected. They concluded that synthetic radiographs were not easily distinguishable from authentic radiographs by radiologists or LLMs. They also said that training physicians and LLMs to recognize synthetic images is essential to mitigate risk and mentioned a curated deepfake dataset as a training resource. Authentication remained difficult across the tested readers and models.

CME Learning Centers

CME/CE Topic Areas

Spotlight On:

Lifestyle

Trending Topics

Radiologists Miss Deepfake Radiographs in Reader Study

Program Chapters

Segment Chapters

Playlist:

Recommended

Radiologists Miss Deepfake Radiographs in Reader Study

Title

Program Chapters

Segment Chapters

Playlist:

Recommended

Get a Dose of ReachMD in Your Inbox and Practice Smarter Medicine