A multi-source domain fine-tuning framework for deep generalization performance in physiological time series analysis
- Joachim A. Behar

- 2 days ago
- 4 min read
Updated: 13 hours ago
By Eran Zvuloni, Guido Gagliardi, Antonio Horta Ribeiro, Antonio Luiz Pinho Ribeiro, Maarten De Vos and Joachim A Behar
This work is the result of an international collaboration between researchers from the Technion–Israel Institute of Technology, KU Leuven (Belgium), the University of Pisa (Italy), Uppsala University (Sweden), and the Federal University of Minas Gerais (Brazil).
Zvuloni Eran, Gagliardi Guido et al 2025 Mach. Learn.: Health https://doi.org/10.1088/3049-477X/ae2ac5
If there is one obstacle that keeps medical AI models from reliably helping patients in the real world, it is generalization. A model may achieve excellent performance on the dataset it was trained on, yet fail dramatically when faced with data from a different hospital, country, device, or patient population.
Our new study Zvuloni et al. (2025) introduces a simple but powerful idea: What if, instead of fine-tuning a model on one dataset, we fine-tune it on many independent datasets, each from a different domain?Our results show that this multi-source strategy can significantly boost out-of-distribution generalization across several major physiological time-series tasks, including ECG and EEG analysis.
The framework is evaluated on 14 ECG and EEG datasets comprising more than 2.4 million ECG recordings and 110,000 hours of EEG, across four clinically relevant tasks: sex identification, age estimation, atrial fibrillation detection, and sleep staging.
The Generalization Problem in Medical AI
Medical AI often performs extremely well on “in-distribution” data, samples collected in the same way, from the same source, under the same conditions as the training dataset. But once deployed, these systems encounter data that differ in:
Patient demographics
Acquisition devices
Recording environments
Labeling standards
Sampling rates
Clinical protocols
Even subtle differences can break performance. This vulnerability is especially problematic in physiological monitoring (e.g., ECG, EEG), where signals are sensitive to noise, hardware, and patient characteristics.
While researchers have explored techniques like domain adaptation and transfer learning, these methods often assume access to target-domain data during training, something that is rarely feasible in practice.
A New Approach: Multi-Source Domain Fine-Tuning
The authors propose a remarkably pragmatic framework:
Start with a large pretrained model. For example, a model trained on millions of ECG recordings from the Brazilian Telehealth Network.
Fine-tune the model on many small, diverse datasets. These serve as multiple “source domains,” each introducing its own clinical style, population, and noise characteristics.
Evaluate on entirely unseen datasets. No overlap, no leakage, a true test of out-of-distribution generalization.
This approach, called Multi-Source Domain (MSD) Fine-Tuning, mirrors how humans learn: we become more robust not by seeing more examples from one source, but by experiencing variety.

The Experiments: 14 Datasets, 4 Tasks, 320 Trained Models
The team performed one of the most extensive multi-domain evaluations to date:
ECG tasks: sex identification, age estimation, atrial fibrillation diagnosis
EEG task: sleep staging
Data: 12-lead ECG and PSG sleep studies
14 independent datasets, including PhysioNet and NSRR cohorts
Over 2.4M ECG recordings and 110k hours of EEG data
Each experiment compared performance between:
SSD – a single-source domain model
MSDₖ – the same model fine-tuned using k additional datasets
MSDₘ – a model fine-tuned using all datasets available for a task
Key Result #1: More Domains Leads to Better Generalization
Across nearly every experiment, performance improved as more fine-tuning datasets were added.
In 23 out of 24 experiments, MSD models outperformed the baseline SSD model.
Gains were largest for tasks with higher domain sensitivity, such as:
Atrial fibrillation detection
Sleep staging
Age estimation
Improvements reached up to +35% OOD performance.

Key Result #2: Multi-Domain Training Aligns Latent Spaces
To understand why MSD improves generalization, the authors performed a latent-space analysis using:
Wasserstein distance to measure how different datasets’ feature embeddings are
t-SNE visualizations to show how latent clusters shift during fine-tuning
The findings:
Fine-tuning on more domains brings the target dataset’s latent space closer to that of the main training dataset.
Representations become more unified, less shortcut-driven, and more physiologically meaningful.
In other words, the model learns true biological features instead of dataset-specific artifacts.
Why This Matters: A Path Forward for Robust Medical AI
This study provides strong evidence that:
1. Diversity beats quantity
More varied datasets improve generalization more effectively than more samples from a single source.
2. Multi-source fine-tuning is practical
It does not require target-domain data, specialized adaptation losses, or adversarial training.
3. MSD can complement foundation models
Large self-supervised models (HuBERT-ECG, Sleep Foundation Models, wearable biosignal transformers) could integrate MSD during task-specific fine-tuning.
4. The community needs more open datasets
The authors’ results strongly argue for expanding access to diverse physiological signal repositories.
Limitations and Future Directions
The paper highlights several promising next steps:
Dataset selection algorithms: Not every dataset contributes equally. How do we pick the best subsets for fine-tuning?
Recording-level quality filters: Low-quality or mislabeled data might harm generalization.
Incremental learning: How should models be fine-tuned as new datasets become available over time?
Integration with scalable SSL foundation models: MSD could be layered on top of large pretrained architectures for even stronger robustness.
Expansion to additional modalities: PPG, EMG, wearable devices, multimodal systems.




Comments