The 90% Nobody Talks About

Every ML tutorial ends the same way: you train a model, the loss curve goes down, and the author writes "and that's it!" as if you've just finished building something real.

You haven't.

Last Tuesday I wrote about why I'm betting on diffusion models for financial forecasting. But before I get into model architectures and denoising networks, let me tell you about what actually matters the 90% around the model that nobody writes Medium posts about.

I spent months building a multimodal GAN called SynMedix that generates synthetic patient data realistic vital signs, lab results, medication histories, and clinical notes, all tied to a single fictional patient. The model itself? I had a working prototype in about two weeks. The other five months were spent on everything else. The infrastructure. The deployment. The parts that don't fit into a conference paper abstract.

This is the story of that 90%.

The Paper Made It Look Easy

The architecture was inspired by a handful of papers on conditional GANs for tabular data. The core idea isn't complicated: train a generator to produce rows of structured patient data that a discriminator can't distinguish from real records. Add a text module for clinical notes using PaLM-E embeddings, and you've got a multimodal system.

On paper, this fits neatly in a diagram. One box for the generator, one for the discriminator, a couple of arrows, a loss function. Clean. Elegant. Publishable.

In practice, every arrow in that diagram is a month of debugging.

Problem 1: The Data Pipeline Was the Real Project

Before I could train anything, I needed clean, structured training data. If you've never worked with healthcare data, let me paint a picture of the chaos.

A single patient's lab results might report hemoglobin in g/dL at one hospital and g/L at another. Blood pressure can show up as "120/80", "120 / 80", "120-80", or in two separate columns. Date formats vary between institutions sometimes within the same institution. ICD codes for the same condition differ between ICD-9 and ICD-10. Missing values aren't labeled "NaN"; they're blank cells, dashes, "N/A", "not performed", or just absent entirely.

I spent three full weeks just building a preprocessing pipeline that could:

Normalize lab values across different measurement standards and unit systems
Parse and standardize date formats into a common schema
Validate clinical code mappings between ICD-9 and ICD-10
Handle missing values with clinically appropriate imputation (you can't just mean-fill a blood pressure reading)
Flag and quarantine records that failed validation checks

This pipeline not the GAN architecture ended up being the most reusable piece of the entire project. I've used versions of it in two other projects since.

The lesson was immediate: the model doesn't care about your research contribution if the data is garbage. And making the data not-garbage was harder than writing the model code.

Problem 2: GCP Vertex AI Has Opinions

I chose Google Cloud's Vertex AI for deployment because it promised managed training and serving. Spin up a training job, point it at your data, get a deployed endpoint. That was the pitch.

What I didn't anticipate was how opinionated the platform is about how you structure your code.

Vertex AI wants your training job packaged as a Python module with a very specific entrypoint structure not just a script, but a module with __init__.py, a task.py, and a particular argument-parsing pattern. It wants your data in Cloud Storage buckets following particular naming conventions. It wants your Docker containers structured a certain way, with specific base images and environment configurations.

None of this was in the original plan. My local training code worked perfectly. It was clean, well-tested Python. But Vertex AI didn't care about "clean" it cared about conforming to its template.

Three weeks of refactoring. Not improving the model. Not tuning hyperparameters. Just restructuring perfectly working code to match a platform's expectations. I rewrote the training entrypoint four times before it passed Vertex AI's validation.

If you're planning to deploy on any managed ML platform Vertex AI, SageMaker, Azure ML budget at least two to three weeks just for the platform integration. It's never "just a deployment."

Problem 3: The 1 AM Quota Incident

My first real training run on Vertex AI required GPUs. Specifically, I needed NVIDIA T4s for the initial experiments and eventually A100s for the full training runs. GCP allocates GPU resources through a quota system that's tied to your project, your region, and your billing account.

Here's what happened: I submitted my training job at about 10 PM after a long day of fixing Docker configurations. The job went to "pending" status. I waited. And waited. No error message. No warning. Just "pending."

At 1 AM, after checking every log file I could find, I finally thought to check the IAM quota dashboard manually. My project had a default GPU quota of zero in the region I'd selected. The job was never going to run. It was just going to sit in the queue forever, silently.

No error in the training logs. No email notification. No alert. Just a job that looked like it was waiting for resources but was actually impossible.

The fix took 24 hours (quota increase requests aren't instant), required me to restructure my training configuration to target a different region with available quota, and taught me to add quota checking to my pre-deployment checklist.

What I learned: silent failures are the most dangerous kind. If something is going to fail, I need it to fail loudly, immediately, and with a clear error message. I now build quota checks, resource validation, and pre-flight tests into every deployment pipeline before anything gets submitted.

Problem 4: Containers, Dependencies, and the CUDA Mismatch

The GAN used PyTorch with CUDA for GPU-accelerated training. My local development environment ran CUDA 11.7. The Vertex AI container I selected used CUDA 11.8. This shouldn't matter, right? Minor version difference. Compatible according to the docs.

It wasn't compatible.

The model trained fine locally. In the cloud, it crashed with cryptic errors deep in PyTorch's CUDA runtime the kind of error messages that give you a hex address and a signal number and nothing else useful. No "your CUDA versions don't match." No "please use container X instead of container Y."

Debugging this took four days. I'll spare you the full archaeology, but the short version is: the PyTorch build I had locally was compiled against a specific cuDNN version that didn't exist in the cloud container. The mismatch only showed up during certain tensor operations that my unit tests didn't cover because they ran on CPU.

The solution was building a custom Docker image with pinned versions of everything CUDA, cuDNN, PyTorch, and every dependency down to the patch level. I now version-lock every single dependency in every ML project, and I test the Docker container locally before pushing it anywhere. # The Dockerfile that finally worked after attempt #7 FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 RUN pip install torch==2.0.1+cu118 --index-url https://download.pytorch.org/whl/cu118 # Every. Single. Version. Pinned.

Problem 5: Monitoring Didn't Exist Until I Built It

Once the model was finally training in the cloud, I had a new problem: I couldn't see what it was doing.

Vertex AI gives you basic job status (running, succeeded, failed) and some system metrics (CPU/memory usage). It doesn't give you GAN-specific metrics. It doesn't tell you if your generator is mode-collapsing. It doesn't show you whether the discriminator has overwhelmed the generator. It definitely doesn't show you whether the synthetic patient data actually looks like real patient data.

I built a custom monitoring layer that:

Logged generator and discriminator losses to Cloud Logging at every epoch
Computed distribution statistics (mean, variance, skewness, kurtosis) for each synthetic feature and compared them against real data distributions
Ran a memorization check every N epochs computing nearest-neighbor distances between synthetic records and the training set to catch the model copying real patients
Generated sample synthetic records and saved them to GCS for manual inspection
Sent Slack alerts if the discriminator accuracy exceeded 95% (a sign the generator has collapsed)

This monitoring infrastructure took two weeks to build. It's not mentioned in any paper I've read about GANs. It's not part of any tutorial. But without it, I would have been training blind spending money on GPU hours without knowing whether the model was actually working.

Problem 6: Serving Is Its Own Project

Training the model is one thing. Serving it making it available as an API endpoint that can generate synthetic patient records on demand is an entirely separate engineering challenge.

Vertex AI's prediction endpoints want models in a specific serialized format. My GAN had two networks (generator and discriminator) that needed to be loaded together, plus preprocessing logic and postprocessing that converted raw tensor outputs back into clinically valid patient records.

I ended up writing a custom prediction container that:

Loaded the generator weights on startup
Accepted JSON requests specifying the number of records and optional conditioning parameters (age range, diagnosis codes)
Ran the generation through the generator network
Applied postprocessing to convert outputs into properly formatted, clinically coded patient records
Validated outputs against the same schema rules I'd built for the preprocessing pipeline
Returned the synthetic records as structured JSON

Getting the cold-start time under 30 seconds was its own sub-project. The model weights were large, and the container needed to load PyTorch, CUDA runtime, and the full model into GPU memory before it could serve a single request.

What Google Already Told Us (and We Didn't Listen)

Google published a paper in 2015 with a diagram that's become famous in MLOps circles. It shows a large rectangle representing an entire ML system, with a tiny box in the center labeled "ML Code." Everything elseata collection, data verification, feature extraction, configuration, analysis tools, process management, serving infrastructure, monitoring dwarfs the model itself.

That diagram is over a decade old now. The industry still hasn't internalized it.

We still interview candidates almost exclusively on model knowledge. We still structure ML courses around architectures and loss functions rather than systems and pipelines. We still treat deployment as an afterthought a one-slide bullet point at the end of a research presentation.

But every production ML system I've built or worked on has confirmed the same ratio: roughly 10% model code, 90% everything else.

What I'd Tell You If You're Starting Out

If you're a student or early-career ML engineer reading this, here's my honest advice:

Get comfortable with Docker. Not "I've heard of Docker" comfortable. I mean "I can debug a multi-stage build at midnight" comfortable. Containerization is the language of deployment, and if you can't speak it fluently, your models stay in notebooks.

Learn a cloud platform deeply. Not the marketing overview. The actual IAM model, the quota system, the networking layer, the storage options, and the billing model. Pick one GCP, AWS, Azure and learn it well enough to debug things when they break at 1 AM.

Build monitoring from day one. Not as an afterthought. Not "I'll add monitoring later." The monitoring is part of the system. If you can't observe what your model is doing in production, you don't have a production model. You have an expensive random number generator.

Version-lock everything. Every dependency. Every base image. Every CUDA version. "It works on my machine" is not a deployment strategy.

Accept that this is the job. The model is the fun part. The infrastructure is the work. And the work is where the value is created because a model that can't be deployed, monitored, and maintained is a model that doesn't exist outside a Jupyter notebook.

The Punchline

We romanticize the model in ML. We show architectures and loss curves and cherry-picked outputs. We post about achieving state-of-the-art on a benchmark. We don't show the container debugging, the data cleaning, the deployment failures, the 1 AM quota fixes, the four days lost to a CUDA mismatch.

But that's where the actual engineering happens. That's the 90%.

The model is the easy part. Shipping it is the job.