AI Fracture Detection on CT: Review Finds Moderate To Good Accuracy

Key Takeaways
- Across unselected cohorts, stand-alone AI showed moderate sensitivity and good specificity for CT fracture detection overall.
- Sensitivity was higher in selected cohorts and internal datasets than in external datasets, and performance varied by analysis level and anatomy.
- Commercial tools showed wider performance ranges and were described as tending to underperform pooled study results, while bias, selected populations, and limited external testing raised applicability concerns.
The systematic review and diagnostic test accuracy meta-analysis examined AI fracture detection on CT. It followed the Cochrane Handbook for DTA and PRISMA-DTA and used a modified QUADAS-2 tool for bias assessment. Searches from January 2010 onward covered Embase, MEDLINE, the Cochrane Library, Web of Science, and Google Scholar, with citation chasing and manual searches for commercial solutions. Two reviewers independently handled study selection, data extraction, and bias assessment. Of 7,683 identified articles, 44 studies entered the meta-analysis, and 14 commercial AI fracture detection solutions were identified, leaving a broad but methodologically mixed evidence base.
Selected cohorts reached pooled sensitivity of 0.89, with a 95% CI of 0.80 to 0.94. Internal test datasets showed sensitivity of 0.94 and specificity of 0.91, with 95% CIs of 0.88 to 0.97 and 0.86 to 0.94. External test datasets showed sensitivity of 0.85 and specificity of 0.92, with 95% CIs of 0.77 to 0.91 and 0.89 to 0.95. Vertebra-wise and rib-wise analyses reached specificity of 0.98, compared with 0.92 for patient-wise analysis, with a 95% CI of 0.89 to 0.95. By anatomy, sensitivity was highest for skull, rib, and pelvis fractures and lowest for spine fractures, showing that accuracy shifted across subgroups.
Performance among commercial tools was more variable, with reported sensitivities ranging from 0.68 to 0.80 and specificities from 0.87 to 0.97. Those ranges reflected variation across identified solutions rather than a single commercial summary. The authors described commercially available solutions as tending to underperform pooled study results. In secondary reader comparisons, stand-alone AI slightly outperformed unaided human readers, while AI assistance produced little further improvement.
Bias, stringent patient selection, and lack of external testing also raised concerns about real-world applicability. Generalized I2 statistics underscored the heterogeneity already seen across cohorts, datasets, and products. The authors called for less biased studies, stronger generalizability and robustness, and prospective trials that assess clinical outcomes. The reported accuracy estimates therefore remained difficult to translate directly to real-world settings.