Radiologists Miss Deepfake Radiographs in Reader Study

Key Takeaways
- Only 7 of 17 blinded radiologists recognized that synthetic images were present, and post-disclosure radiologist accuracy was 75% in the GPT-4o dataset and 70% in the RoentGen dataset.
- On the GPT-4o task, GPT-4o and GPT-5 reached 85% and 83%, while Llama 4 Maverick and Gemini 2.5 Pro reached 59% and 56%, and none detected all synthetic radiographs.
- Recurring clues included bilateral symmetry, uniform grain or noise, unnatural soft-tissue texture, and smooth bone surfaces, and the authors concluded that training physicians and LLMs to recognize synthetic images is essential to mitigate risk.
Seventeen practicing radiologists from six countries, with varying levels of experience, participated in the reader study. The same classification task was also assigned to four multimodal LLMs: GPT-4o, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick. Phase 1 used 154 radiographs from multiple anatomic regions, including 77 GPT-4o-generated synthetic images and 77 authentic images, during a blinded review. Phase 2 followed disclosure of the study purpose and asked radiologists to label randomly presented radiographs as GPT-4o-generated or authentic. Phase 3 analyzed 110 chest radiographs, split between 55 RoentGen-generated synthetic images and 55 authentic images, framing the exercise as an authenticity task rather than a diagnostic interpretation exercise.
After disclosure, overall radiologist accuracy reached 75% for the GPT-4o dataset, with a 95% CI of 68 to 81. Accuracy in the RoentGen dataset was 70%, with a 95% CI of 62 to 78, and the comparison yielded P=.07. In the multimodal LLM panel, performance on the GPT-4o-generated radiograph task was 85% for GPT-4o and 83% for GPT-5. Llama 4 Maverick and Gemini 2.5 Pro were lower at 59% and 56%, respectively, with P<.001 for those model comparisons. No tested LLM detected all synthetic radiographs in either dataset.
Recurring visual cues in synthetic radiographs included bilateral symmetry and uniform grain or noise patterns. The authors also noted subtly unnatural soft-tissue textures and bone surfaces that appeared smoother than expected. They concluded that synthetic radiographs were not easily distinguishable from authentic radiographs by radiologists or LLMs. They also said that training physicians and LLMs to recognize synthetic images is essential to mitigate risk and mentioned a curated deepfake dataset as a training resource. Authentication remained difficult across the tested readers and models.