machine-learning

Building a Production-Ready ML System: From Jupyter Notebook to Docker

Muhammad Ali Syed
August 20, 2025
10 min read

How I learned that sometimes the best model isn't the most complex one, and why that's actually a valuable lesson.

Introduction

When I set out to build a customer churn prediction system, I thought I'd be showcasing cutting-edge deep learning models and complex ensemble techniques. Instead, I learned one of the most valuable lessons in machine learning engineering: sometimes a simple logistic regression beats XGBoost, and that's perfectly fine if you understand why.

This is the story of building a production-ready ML system from scratch, complete with all the messy realities, unexpected discoveries, and "aha!" moments that textbooks don't tell you about.

The Challenge

Telecommunications companies lose millions when customers churn. The challenge wasn't just to predict who would leave, but to build a system that could:

  • Make predictions in real-time
  • Optimise for business value (not just accuracy)
  • Be maintainable and deployable
  • Handle messy, real-world data

Stage 1: Setting Up Like a Professional

I started by creating a proper project structure, not just throwing code into a Jupyter notebook:

churn-prediction/

├── src/

├── notebooks/

├── tests/

├── data/

└── models/

Setting up a Conda virtual environment was crucial for reproducibility:

conda create -n ml-dev-310 python=3.10

conda activate ml-dev-310

Lesson learned: Taking 30 minutes to set up properly saves hours of debugging later.

Stage 2: The EDA That Actually Revealed Something

Instead of just running df.describe() and calling it a day, I created a professional EDA notebook with proper structure and business context. This is where I discovered the first interesting pattern:

  • Customers who churned had an average tenure of 17 months
  • Customers who stayed had an average tenure of 37 months
  • Churned customers were paying about £13 more on average

But the real gold was discovering the "premium but unprotected" segment. Customers with fiber optic internet but no protection services. These customers were churning at alarming rates.

Stage 3: When Real Data Fights Back

The TotalCharges Mystery

My first reality check came with a cryptic error:

TypeError: can only concatenate str (not "int") to str

The TotalCharges column was stored as strings! This is exactly the kind of thing that happens in real projects. Some investigation revealed:

# The culprit: empty strings in numeric columns
non_numeric = df[pd.to_numeric(df['TotalCharges'], errors='coerce').isna()]
print(f"Found {len(non_numeric)} rows with non-numeric TotalCharges")

Solution: Created a robust data cleaning function that handles these edge cases gracefully.

The Case of the Persistent Nulls

Even after cleaning, I still had 11 null values. The issue? My tenure buckets didn't include 0:

# Before: bins=[0, 6, 12, 24, 48, 72]

# After: bins=[-1, 6, 12, 24, 48, 72] # Include 0 tenure

Lesson learned: Always validate your assumptions about data ranges.

Stage 4: The Model That Humbled Me

Here's where things got interesting. I built three models:

  1. Baseline Logistic Regression: 84.0% AUC
  2. XGBoost with tuning: 83.6% AUC
  3. LightGBM with regularization: 83.8% AUC

Wait, what? The simple model beat the complex ones?

This sent me down a debugging rabbit hole. Was it overfitting? Bad hyperparameters? No - the truth was simpler and more profound.

Stage 5: The Probability Distribution Detective Work

When I noticed that my business optimiser always selected a 0.1 threshold regardless of parameters, I knew something was wrong. Plotting the probability distributions revealed the issue:

Blog image

The model was extremely conservative:

  • No Churn: Heavily concentrated near 0 (good!)
  • Churn: Spread across 0.2-0.8 with low density (bad!)

This is classic behavior with imbalanced datasets (74% no churn, 26% churn).

Attempts to Fix:

  1. Class weights: Helped a bit
  2. SMOTE: Marginal improvement
  3. Polynomial features: Still overlapping distributions
  4. Ensemble methods: No significant gains

The hard truth? The features simply didn't contain enough signal to perfectly separate the classes.

Stage 6: Embracing Reality and Optimising for Business Value

Instead of chasing marginal ML metrics improvements, I pivoted to what actually matters: business value.

I built a custom business optimiser that considers:

  • Cost of retention campaigns: $10 per customer
  • Success rate of retention: 30%
  • Customer lifetime value: Monthly charges × 12

The results were eye-opening:

# At 0.5 threshold: Losing money!

# At optimized 0.35 threshold: 245% ROI

Stage 7: From Notebook to Production

This is where the real engineering began:

Centralized Logging

Created a proper logging system that saves to both console and files:

logger.info("Model training completed", extra={

'auc':0.84,

'threshold':0.35,

'expected_roi':2.45

})

Production Pipeline

Built a modular system with:

  • Feature engineering pipeline that can be pickled
  • Model versioning and metadata tracking
  • Comprehensive error handling

API Development

Created a Flask API with proper endpoints:

GET /health

POST /predict

POST /predict_batch

POST /update_threshold

The moment of truth came when I ran:

curl http://localhost:5000/health

And got back:

{"status": "healthy", "model_version": "production", "model_loaded": true}

Success! The API was working!

Stage 8: The Testing Saga

Writing tests revealed even more edge cases:

  • What if required fields are missing?
  • What if TotalCharges is missing entirely?
  • How do we handle malformed JSON?

Each failing test taught me something about production readiness.

Key Takeaways

  1. Simple models can be the right choice: My logistic regression beat XGBoost because the dataset was small and the relationships were largely linear.
  2. Data quality > Model complexity: Hours spent on feature engineering and data cleaning paid off more than hyperparameter tuning.
  3. Business metrics > ML metrics: A model with 84% AUC that optimises for ROI beats a 90% AUC model that doesn't.
  4. Production is 90% of the work: Getting from notebook to API required more code than the actual modeling.
  5. Logging is not optional: When debugging production issues, logs are your lifeline.

What I'd Do Differently

  • Start with the business optimiser earlier
  • Invest more in data quality checks upfront
  • Build the API structure parallel to model development
  • Add data drift monitoring from the beginning

Conclusion

This project taught me that ML engineering is about so much more than choosing the right algorithm. It's about:

  • Understanding the business problem deeply
  • Building robust, maintainable systems
  • Making pragmatic decisions based on constraints
  • Documenting and testing thoroughly

The final system may use "just" logistic regression, but it's production-ready, well-tested, and optimised for real business value. Sometimes that's exactly what machine learning engineering looks like in the real world.

Want to explore the code? Check out the GitHub repository for the complete implementation.

#mlops#python#data science#machine learning#production ml#api development#docker#flask#scikit-learn#ml engineering#feature engineering#business optimisation#testing#real world ml

Share this article