Volumetric OCT Foundation Models (V-JEPA): From Slices to 3D Volumes

A recent paper explores using video-transformer–based foundation models on full optical coherence tomography (OCT) volumes, with the goal of learning from volumetric context rather than a single B-scan. The model, V-JEPA, is evaluated in a framework intended to capture 3D retinal structure from OCT volumes and is then fine-tuned for downstream disease detection. The authors frame the study as a benchmark of this volume-native approach against existing slice- or image-based foundation models for retinal OCT classification tasks, focusing on a head-to-head comparison of volume-level versus slice-level pretraining for OCT disease detection.
The paper presents the architectural shift as a change in what context the model can access during representation learning. Slice-based strategies typically operate on a single central B-scan or treat each B-scan independently, which the authors describe as discarding information across adjacent slices in a scan. In contrast, the reported approach treats an OCT volume as a spatiotemporal-like sequence suitable for video models, allowing features to be learned from relationships across B-scans and across depth. In the authors’ framing, this “volumetric context” is intended to support representations that are less dependent on any one slice and more reflective of 3D retinal morphology, encoding structural patterns that emerge when the full volume is modeled jointly.
For benchmarking, the authors report results across multiple OCT datasets and detection tasks, summarizing performance as average AUROC with ranges across evaluated settings. They report an average AUROC of 0.94 (range 0.80–0.99) for V-JEPA versus 0.90 (range 0.76–0.98) for the best-performing image-based foundation model, with a reported p value of <0.001 for the difference. The paper describes this evaluation as contrasting fine-tuned performance when models are trained to use volumetric OCT volumes versus when they are trained on 2D/slice-based inputs. In aggregate, the authors present these findings as showing the volumetric pretraining approach matching or exceeding the compared slice- and image-based baselines under the reported conditions.
Alongside benchmark metrics, the paper emphasizes that using full OCT volumes (rather than a single central B-scan) is intended to better capture volumetric retinal context for diagnostic classification tasks. The authors also note that longitudinal use is not evaluated in the reported results and describe future work that would compare OCT volumes over time.
The study reports using five OCT datasets for fine-tuning and evaluation, including open access datasets named CirrusOCT, Gamma, A2A OCT, and NEH-UT, and it notes that the HYRD dataset may be made available for non-commercial academic use with permission via the authors. The benchmarking includes comparisons to retinal foundation models (RETFound and VisionFM) and a natural-image foundation model (DINOv2), as described in the paper’s abstract.
Key Takeaways:
- The paper reports a video-transformer–based foundation model trained on full 3D OCT volumes and benchmarked against slice- and image-based foundation models for retinal disease detection.
- Across the reported multi-dataset evaluations, the authors report that the volumetric approach equaled or outperformed image-based models on average (AUROC 0.94 vs 0.90 for the best image model), with a statistically significant difference (p < 0.001).
- The authors describe volume-native outputs and longitudinal framing examples, including volumetric summaries and segmentation across B-scans, as use cases aligned with volume-level representations.