Multisource domain training with retinal expert-in-the-loop for accuratediabetic retinopathy staging from fundus images

Joachim Behar
Feb 25
3 min read

Published in Machine Learning: Health https://iopscience.iop.org/article/10.1088/3049-477X/ae4984

By Renee Zacharowicz, Yevgeniy A Men, Luis Filipe Nakayama and Joachim A Behar

Research overview:

Open retinal fundus image datasets have been a gift to the diabetic retinopathy (DR) community—but they come with a catch: the labels are often inconsistent, noisy, or simply wrong. And when your goal is fine-grained DR staging (not just “disease vs. no disease”), label noise can quietly cap performance and undermine generalization across clinics, cameras, and populations.

In our latest paper, we introduce DRStageNet2, a DR staging model that improves on our prior work without changing the architecture, the gains come exclusively from a practical data refinement strategy that puts a retinal specialist directly “in the training loop.” The core problem: you can’t fix 90k labels by hand

DR staging is typically reported on the ICDR scale (0–4). In many screening workflows, referable DR (rDR) begins at stage ≥2, while vision-threatening DR is generally associated with stages 3–4.

But across large public datasets, even small grading inconsistencies add up. Fully re-labeling entire datasets would require enormous expert time, often unrealistic for most teams.

So we asked a simple question:

What if we only ask an expert to review the most informative images—those the model strongly disagrees with?

Our approach: multisource training + selective expert reannotation

1) Train across diverse domains (and stress-test generalization)

We train using six public datasets spanning different geographies, clinical workflows, imaging devices, and acquisition conditions:EYEPACS, DDR, APTOS, BRSET, MESSIDOR2, and IDRiD.

To evaluate out-of-domain robustness, we use an all-against-one / leave-one-domain-out strategy:train on five datasets, test on the sixth—repeat for each dataset.

This setup forces the model to face real-world heterogeneity: different cameras, different annotation styles, different populations, and different image quality profiles.

2) Use the model to triage likely label errors

Instead of random relabeling, we select misclassifications and send only those to a retinal specialist—blind to the original labels—then update the dataset and retrain.

We do this in three rounds, with a deliberate progression:

Round 1: prioritize large errors (e.g., predictions ≥2 ICDR stages away) to correct the biggest mistakes.
Rounds 2–3: prioritize near-boundary errors (≈1 stage away) to refine borderline cases and reduce ambiguity.

This turns “model errors” into a data-driven curriculum for expert attention.

The headline result: big gains from a small amount of expert time

Across 91,984 training images, our retinal expert reviewed 3,984 images—just 2.8% of the dataset—and modified 2,592 labels (65% of reviewed images).

That selective reannotation translated into measurable improvements in out-of-domain generalization:

Mean multiclass accuracy (MC-ACC): +3.6% across datasets
Quadratic weighted kappa (Q-kappa): improved from 0.89 → 0.91 on average
Gains reached up to +6.2% in one dataset
Improvements were statistically significant in aggregate

Why kappa matters: unlike plain accuracy, quadratic-weighted kappa accounts for how far off a prediction is—more aligned with clinical staging, where “one grade off” is not the same as “four grades off.” What performance looks like across datasets

Here’s a snapshot of final DRStageNet2 performance across the six evaluation datasets:

Dataset	rDR AUC	MC-ACC	Q-kappa
EYEPACS	0.99	0.87	0.89
BRSET	0.99	0.90	0.92
DDR	0.98	0.82	0.91
IDRiD	0.99	0.84	0.91
MESSIDOR2	0.99	0.86	0.92
APTOS	0.97	0.82	0.93

Why this matters beyond DR

A scalable alternative to “gold standard everything”

High-quality adjudicated grading is ideal—but full multi-expert adjudication pipelines can be out of reach for many teams. This work shows that one experienced retinal specialist, used strategically, can deliver meaningful gains while keeping expert time manageable.

A general recipe for messy real-world data

This pattern is not limited to retinal images:

Train a baseline model across heterogeneous sources
Identify where it fails the most (and where it’s uncertain)
Spend expert time only there
Iterate

In other words: use the model to tell you where the dataset is wrong.