Why Your Deep Learning Model Isn't Learning: Diagnosing Data Problems in Medical Imaging
I built a clean, well-structured deep learning pipeline using MONAI (Medical Open Network for AI) on a public abdominal ultrasound dataset. The pipeline included: proper subject-grouped train/validation splits robust preprocessing carefully decoded segmentation masks sensible loss functions consistent evaluation And the model still struggled to learn. The interesting part isn't that the model underperformed. What mattered was the diagnosis: a series of simple checks that traced the problem back to the dataset, not the model. Those checks are useful far beyond medical imaging. They apply to almost any machine learning project. If you're new to ML, this is a lesson worth carrying into every project: understand your data before you tune your model. I set out to build a medical image segmentation tutorial. I ended up learning a more valuable lesson: no amount of careful engineering can rescue a model from a dataset that can't support the task. By the end of this article, you'll understand: How to evaluate whether a dataset can actually support your task Why "the model isn't learning" is often a data problem How to rule out engineering bugs before blaming the data Practical diagnostics you can run in minutes Why synthetic training data often struggles in real-world deployment When to stop tuning and walk away from a dataset This is not a beginner introduction to deep learning – it assumes familiarity with concepts like UNet architectures and training loops. But the data-quality lessons apply broadly to many ML projects. The Dataset Step 1: Rule Out the Pipeline Before Blaming the Data Subject-grouped splits Decoding masks correctly Loss design and class weighting Step 2: The Model Still Struggled Step 3: Interrogating the Dataset Diagnostic 1: What Does the Dataset Actually Contain? Diagnostic 2: Do Synthetic and Real Images Look Similar? Diagnostic 3: Can the gap be fixed by adding real data? Step 4: Knowing When to Stop A Practical Dataset Evaluation Checklist What I Would Try Next The Bigger Lesson I used the US Simulation & Segmentation dataset, a public collection of abdominal ultrasound images with organ segmentation labels from Kaggle. It contains: 926 synthetic ultrasound images— generated by a ray-casting simulator from CT scans, with full organ annotations 617 real ultrasound images— from an actual ultrasound scanner Labels for 8 organs— liver, kidney, gallbladder, pancreas, spleen, bones, vessels, and adrenals At first glance, the dataset looked ideal: thousands of images multiple organ classes both synthetic and real ultrasound data Whether it actually supported the task was a different question. Ground rule: you should always rule out the pipeline before blaming the data. A model failing on buggy code looks exactly like a model failing on bad data. The engineering needs to be trustworthy. A common mistake in medical imaging is randomly splitting images into train and test sets. That approach is problematic because many frames come from the same patient. Those frames share anatomy, scanner settings, and noise patterns. If frames from the same patient appear in both the train and test sets, the model can partially memorize patient-specific patterns. Test scores look artificially good, even though the model may fail on truly unseen patients. This is called subject leakage. The fix is to split by patient instead of by image: That assertion matters.If the split logic ever breaks, the pipeline fails loudly instead of silently producing misleading metrics. The dataset stores labels as color-coded masks. Each organ corresponds to a different RGB color. Training requires converting those colors into integer class labels. A naïve implementation uses exact color matching, but resizing operations can slightly alter colors at mask boundaries. A more robust approach maps each pixel to its nearest palette color: Before training, it’s worth visually checking a few decoded masks against the original images. This catches issues like incorrect palettes, RGB/BGR channel swaps, or resizing artifacts that silently corrupt labels. These bugs rarely throw errors. Instead, the model simply learns poorly. And “trained on wrong labels” looks exactly like “the model can’t learn the data.” Verifying masks early removes that uncertainty. For training, I usd standard MONAI segmentation losses. The goal wasn’t to aggressively maximize performance, but to establish a stable and trustworthy baseline. The training curves below show that the model optimized normally: the loss decreased consistently, and the validation dice stabilized rather than diverging. This helped rule out optimization instability as the primary cause of poor final performance. Three choices were deliberate: Dice + Cross-Entropy combined:Cross-entropy keeps learning stable early on – Dice directly rewards good region overlap. Together they balance each other. Class weighting for multi-class segmentation:With organs of very different sizes, an unweighted loss lets the model ignore the small, rare ones and still score well. Weighting rare-class mistakes more heavily pushes back against that. The first experiment focused on liver segmentation — the simplest single-organ task in the dataset. Dice scores range from 0 (no overlap) to 1 (perfect overlap). Qualitatively, the predictions often captured rough liver regions but failed at boundaries and consistency across real scans. Especially important: the model struggled even on synthetic in-domain data performance dropped further on real ultrasound images At this point, two explanations were possible: the model or pipeline was flawed the dataset itself was limiting performance Because the engineering had been carefully validated, the second possibility became worth investigating seriously. That's where the real lesson began. Rather than endlessly tuning the model, the productive move is to turn the diagnostic lens on the dataset. Three simple checks revealed the real problem. None required retraining or expensive experiments. The first step was simply plotting the dataset composition. 926 labeled synthetic images(the bulk of training data) Only 60 labeled real images— less than 4% of the dataset 557 unlabeled real images— real data exists, but without labels it can't be used for supervised training This immediately changed the interpretation of the dataset. Although the dataset contains many real ultrasound scans, almost all labeled training data is synthetic. The model is effectively trained on synthetic ultrasound and expected to generalize to real ultrasound. That's a difficult transfer problem from the start. The limitation is simple: the real images mostly don't have labels, so supervised training has very little real-world data to learn from. Lesson:Before training anything, chart the dataset composition. A headline image count can be misleading. "1,500 images" sounds large until you discover that only a tiny fraction are labeled examples from the target domain. The next question was whether the synthetic and real ultrasound images actually followed similar visual distributions. Plotting intensity histograms showed a clear mismatch. synthetic images clustered heavily near darker intensities real ultrasound images had broader mid-range intensity distributions The synthetic simulator captured anatomical geometry reasonably well, but it didn't reproduce the texture and noise characteristics of real ultrasound: speckle patterns intensity falloff scanner-specific artifacts This is the classic synthetic-to-real domain gap. The model learned features tuned to synthetic images and then encountered a substantially different distribution during evaluation. Poor transfer performance became expected rather than surprising. Lesson:Whenever training and deployment happen on different domains — synthetic → real, scanner A → scanner B, hospital A → hospital B — measure the distribution shift directly. Simple histogram comparisons can reveal major problems in minutes. The obvious next idea was: why not include some real labeled data during training? But before implementing that approach, it's worth checking how many distinct patients actually had labels. Only fourpatients. That result fundamentally changed the situation. Proper medical imaging evaluation requires subject-grouped train/test splits. But with only four patients, any evaluation becomes statistically unstable. Training on two or three patients and testing on one or two patients would produce highly unreliable metrics that depend heavily on which patient happened to be held out. At that point, the dataset simply couldn't support trustworthy real-world evaluation. Lesson:In medical imaging, count subjects, not images. The true size of a dataset is bounded by the number of independent patients, not the number of files. At this point, additional tuning no longer made sense. The bottleneck was not the architecture, optimizer, or learning rate. The bottleneck was the dataset itself. The pipeline was still valuable and reusable. But this particular dataset couldn't reliably support the intended segmentation task. That distinction matters: sometimes a problem is difficult but solvable, and sometimes the data simply can't support the conclusion you want to draw. Learning to recognize the difference is an important ML skill. Before committing weeks to model development, these checks are worth running on any dataset: Chart the dataset composition— labeled vs unlabeled, class distribution, domain distribution Count subjects, not images— independent patients matter more than frame count Check class balance— rare classes are often ignored without weighting or sampling strategies Compare train and deployment distributions— especially for cross-domain problems Verify labels visually— catch preprocessing or annotation errors early Look for published baselines— low published performance may indicate dataset limitations These checks take minutes and can save weeks of unnecessary tuning. Improving results would likely require better data rather than a larger model. The next steps I'd prioritize: collecting more labeled real ultrasound scans, from more distinct patients improving annotation consistency semi-supervised learning to make use of the unlabeled real images domain adaptation between synthetic and real ultrasound All of these target the actual bottleneck: data quality and data diversity. In machine learning, it's easy to focus most of our attention on architectures, hyperparameters, optimization tricks, and newer models. But the dataset quietly defines the ceiling. A sophisticated model on weak data often disappoints, while a simpler model on strong data performs surprisingly well. That was the real lesson from this project. The most valuable skill wasn't building the pipeline. It was diagnosing why the model couldn't succeed and being willing to trust what the data was saying. The workflow — checking dataset composition, counting subjects, comparing distributions, ruling out engineering bugs, and deciding when to stop — transfers to almost any ML project. In many projects, better judgment about the data matters more than a better model. The pipeline code and diagnostic notebooks are available at the MONAI Ultrasound Working Group repository. Questions, corrections, and improvements are always welcome.What We'll Cover:
The Dataset
Step 1: Rule Out the Pipeline Before Blaming the Data
Subject-Grouped Splits
from sklearn.model_selection import GroupShuffleSplitdef assign_splits(manifest, val_fraction=0.15, seed=42): train_data = manifest[manifest["orig_split"] == "train"] groups = train_data["subject_id"].values gss = GroupShuffleSplit(n_splits=1, test_size=val_fraction, random_state=seed) train_idx, val_idx = next(gss.split(X=train_data, y=None, groups=groups)) train_subjects = set(train_data.iloc[train_idx]["subject_id"].unique()) val_subjects = set(train_data.iloc[val_idx]["subject_id"].unique()) # Crash loudly if leakage ever sneaks in assert train_subjects.isdisjoint(val_subjects), "Subject leak detected!" return train_subjects, val_subjectsDecoding Masks Correctly
import numpy as npPALETTE = np.array([ [0, 0, 0], [100, 0, 100], [255, 255, 255], [0, 255, 0], [255, 255, 0], [0, 0, 255], [255, 0, 0], [255, 0, 255], [0, 255, 255],], dtype=np.int32)def decode_mask(mask_rgb): h, w = mask_rgb.shape[:2] flat = mask_rgb.reshape(-1, 3).astype(np.int32) d2 = ( (flat[:, None, :] - PALETTE[None, :, :]) ** 2 ).sum(-1) classes = d2.argmin(axis=1).astype(np.uint8) return classes.reshape(h, w)Loss Design and Class Weighting

include_background=Falsefor binary segmentation:In a single-organ task, background can be 85–90% of the pixels. Counting it in the loss drowns out the signal for the organ you actually care about, so it's better left out.Step 2: The Model Still Struggled
Test set Liver Dice Synthetic test set ~0.68 Real ultrasound test set ~0.48 Step 3: Interrogating the Dataset
Diagnostic 1: What Does the Dataset Actually Contain?

Diagnostic 2: Do Synthetic and Real Images Look Similar?

Diagnostic 3: Can the gap be fixed by adding real data?
Labeled real images: 60Distinct subjects (labeled real): 4Frames per subject: subject h: 26 subject a: 16 subject g: 10 subject b: 8Step 4: Knowing When to Stop
A Practical Dataset Evaluation Checklist
What I Would Try Next
The Bigger Lesson
- 最近发表
- 随机阅读
-
- Backend Challenges Teams Face When Processing Repeat Payments
- How to Build a Browser
- Nikhil Adithyan
- AI Paper Review: Chain
- Bansidhar Kadiya
- Microservices
- Geopolitical Risk Isn't One Thing. I Built a Python Framework to Prove It
- JavaScript
- Mohammed Fahd Abrah
- Geopolitical Risk Isn't One Thing. I Built a Python Framework to Prove It
- software architecture
- Md Tarikul Islam
- How to Build a PostgreSQL
- Shola Jegede
- Machine Learning
- How to Build a Browser
- How to Build a Browser
- Nikhil Adithyan
- Machine Learning
- Md Tarikul Islam
- 搜索
-