External Validation of Deep Learning for Barrett's Dysplasia

06/18/2026

Key Takeaways

External testing assessed classification across nondysplastic Barrett's esophagus, low-grade dysplasia, and high-grade dysplasia on whole slide images from three academic centers.
Specificity reached 94.8% for high-grade dysplasia, while class performance varied across the three histology categories.
Slide stain characteristics were normalized with cycle-generative adversarial networks before an ensemble pipeline that paired a You Only Look Once model with ResNet101, and the authors described overall accuracy as substantial.

In a multisite external validation of BEDDLM for Barrett's esophagus dysplasia, investigators tested the previously cross-validated model on 489 whole slide images from three external academic centers. Against expert consensus grading, specificity for high-grade dysplasia reached 94.8% in the external cohort. The assessment included nondysplastic Barrett's esophagus, low-grade dysplasia, and high-grade dysplasia on digitized slides.

The study evaluated a previously cross-validated deep learning system designed to predict dysplasia grade on whole slide images. Slides were digitized from three external academic centers, and a consensus read by two expert study pathologists served as the criterion standard. By consensus histopathology, the cohort included 232 nondysplastic Barrett's esophagus slides, 117 low-grade dysplasia slides, and 140 high-grade dysplasia slides. Patients had a mean age of 66.9 years, and 84.7% were men.

Before classification, slide stain characteristics were normalized with cycle-generative adversarial networks. The ensemble framework then used a You Only Look Once model followed by a ResNet101 classifier. This sequence formed the preprocessing and classification pathway for assessing whole slide images.

Performance was reported separately for each consensus histology class. For NDBE, sensitivity was 73.3% (95% CI, 67.09%-78.85%), specificity was 93.4% (95% CI, 89.62%-96.10%), and the F1 score was 0.81. For LGD, sensitivity was 84.6% (95% CI, 76.78%-90.62%), specificity was 80.6% (95% CI, 76.26%-84.54%), and the F1 score was 0.69. For HGD, sensitivity was 80.7% (95% CI, 73.19%-86.89%), specificity was 94.8% (95% CI, 91.97%-96.91%), and the F1 score was 0.83.

Across the three classes, specificity was highest for HGD and also exceeded 90% for NDBE, while LGD had the lowest F1 score among the reported categories.

CME Learning Centers

CME/CE Topic Areas

Spotlight On:

Lifestyle

Trending Topics

External Validation of Deep Learning for Barrett's Dysplasia

Program Chapters

Segment Chapters

Playlist:

Recommended

External Validation of Deep Learning for Barrett's Dysplasia

Title

Program Chapters

Segment Chapters

Playlist:

Recommended

Get a Dose of ReachMD in Your Inbox and Practice Smarter Medicine