Ophthalmology foundation models for clinically significant age macular degeneration detection
- Joachim Behar
- Jan 16
- 4 min read
By Benjamin A Cohen, Jonathan Fhima, Meishar Meisel, Baskin Meital, Luis Filipe Nakayama, Eran Berkowitz and Joachim A Behar
Part of the Lirot.ai project: https://www.aimlab-technion.com/lirot-ai Do we really need “retina-only” foundation models to detect AMD?
Age-related macular degeneration (AMD) is a leading cause of irreversible vision loss, and scaling reliable screening from routine retinal photos (digital fundus images, DFIs) is a major clinical and public-health opportunity. In our latest paper, we asked a deceptively simple question:
If we want strong AMD detection across hospitals, cameras, and countries, do we truly need ophthalmology-specific foundation models, or can general-purpose vision foundation models do just as well?

Summary Figure. A: The datasets used for this work. B: Pipeline for the creation of AMDNet. C: Results of the benchmark. D: Comparing AMDNet to state of the art model for AMD detection, and showing impact of multi source domain training.
The core idea: benchmark “general” vs “ophthalmic” foundation models properly
Self-supervised learning (SSL) has produced powerful Vision Transformer (ViT) backbones that transfer well across tasks. Retina foundation models (trained on huge collections of unlabeled DFIs) are often assumed to be best-in-class for retinal disease detection but evidence has been mixed, and direct apples-to-apples benchmarking is rare.
So we built a large, multi-dataset benchmark focused on clinically significant AMD, defined here as moderate-to-late AMD (wet/neovascular AMD, geographic atrophy, or large drusen).
Why “out-of-domain” performance matters
Most models look great when tested on a dataset similar to training data. Real clinics are messier: different patient populations, imaging protocols, field-of-view (FOV), cameras, and prevalence. That’s why we emphasized out-of-domain (OOD) evaluation—testing on datasets not seen during training—as a more realistic proxy for clinical robustness.
What we evaluated
7 datasets, ~70k expert-annotated fundus images
We curated 70,523 DFIs across seven independent datasets, spanning multiple geographies and acquisition settings.
6 SSL-pretrained ViTs (plus baselines)
We benchmarked four general SSL backbones pretrained on natural images (MAE, Mugs, iBOT, DINOv2) and two ophthalmic foundation models pretrained on large-scale DFIs (RETFound, VisionFM).
All models were standardized through the same preprocessing pipeline (crop → pad to square → resize to 518×518) and similar fine-tuning hyperparameters to keep comparisons fair.
The headline result: DINOv2 matches (or beats) retina-specific foundation models
Across the benchmark, DINOv2 (pretrained on natural images) achieved performance that was similar to the ophthalmology-specific foundation models and in many settings, it was the best overall choice for robust OOD AMD detection. This directly challenges the assumption that in-domain pretraining is always necessary for top-tier performance on retinal tasks.
From benchmark to model: AMDNet
After identifying the strongest backbone (DINOv2), we built AMDNet using a training strategy that consistently improves cross-domain robustness:
Multi-source domain training (the “leave-one-domain-out” idea)
Instead of training on a single dataset and hoping it generalizes, AMDNet uses a multi-source domain (MSD) strategy: train on all but one dataset, test on the held-out dataset, and repeat. This setup both stresses and improves OOD performance.
How much did AMDNet improve?
Compared with DINOv2 trained only on AREDS, AMDNet improved performance by ~2.1%.
And compared with a widely used open-source AMD detection approach (DeepSeeNet), AMDNet’s AUROC was ~10.6% higher on average, a large and clinically meaningful gap.
A new contribution: BRAMD (Brazil AMD dataset), open-access
A major barrier to generalizable medical AI is geographic bias: models trained on a narrow set of populations may underperform elsewhere.
To help address this, we’re releasing BRAMD, a new open-access dataset from Brazil: 587 DFIs from 472 patients, with AMD labels confirmed via clinical exams supported by OCT.
A few BRAMD specifics:
AMD cohort: 295 images, ages 54–94, including 237 wet AMD cases
Controls: 292 images from diabetic retinopathy patients, ages 18–91
Devices: Canon CR and Nikon NF5050 (two resolutions)
What the model gets wrong (and why that’s useful)
We didn’t stop at top-line AUROC. We also looked at failure modes to understand clinical edge cases.
Different AMD subtypes behave differently
On AREDS, AMDNet was much better at detecting GA and NVAMD than large drusen (LD) consistent with LD being subtler and more easily confused with other macular changes.
“Lookalike” comorbidities drive false positives
On a multi-disease dataset, false positives clustered around conditions with AMD-like appearances—e.g., macular scars and retinal pigment epithelium changes, highlighting where clinicians might want extra caution or additional imaging.
Explainability: heatmaps show attention on the macula
Using Grad-CAM style attention maps, we observed AMDNet’s focus tends to concentrate on the macular region—even in healthy eyes—and highlights lesion regions in AMD examples.
Limitations (and the next steps)
We’re excited about the results, but we’re also explicit about what remains to be solved:
Broader validation: even 7 datasets isn’t “the world.” More medical centers, camera types, and real-world distributions are needed.
Clinical interpretability: AMDNet does not explicitly segment clinical objects (drusen/GA), which may matter for trust and decision support.
Generalizability of the conclusion: while in-domain pretraining didn’t help here, more downstream ophthalmic and systemic tasks should be benchmarked to map where domain-specific pretraining does add value.
Takeaways
General vision foundation models can be excellent retinal backbones. In our benchmark, DINOv2 (natural images) matched or surpassed retina-specific foundation models for moderate-to-late AMD detection.
Training strategy matters. Multi-source domain training yielded an additional boost (AMDNet) and improved robustness across unseen datasets.
Data diversity is everything. BRAMD expands representation to a Brazilian cohort and is released open-access to support future, fairer evaluation.




Comments