Building a Production-Ready ML System: From Jupyter Notebook to Docker
How I learned that sometimes the best model isn't the most complex one, and why that's actually a valuable lesson.
Introduction
When I set out to build a customer churn prediction system, I thought I'd be showcasing cutting-edge deep learning models and complex ensemble techniques. Instead, I learned one of the most valuable lessons in machine learning engineering: sometimes a simple logistic regression beats XGBoost, and that's perfectly fine if you understand why.
This is the story of building a production-ready ML system from scratch, complete with all the messy realities, unexpected discoveries, and "aha!" moments that textbooks don't tell you about.
The Challenge
Telecommunications companies lose millions when customers churn. The challenge wasn't just to predict who would leave, but to build a system that could:
- Make predictions in real-time
- Optimise for business value (not just accuracy)
- Be maintainable and deployable
- Handle messy, real-world data
Stage 1: Setting Up Like a Professional
I started by creating a proper project structure, not just throwing code into a Jupyter notebook:
churn-prediction/
├── src/
├── notebooks/
├── tests/
├── data/
└── models/
Setting up a Conda virtual environment was crucial for reproducibility:
conda create -n ml-dev-310 python=3.10
conda activate ml-dev-310
Lesson learned: Taking 30 minutes to set up properly saves hours of debugging later.
Stage 2: The EDA That Actually Revealed Something
Instead of just running df.describe()
and calling it a day, I created a professional EDA notebook with proper structure and business context. This is where I discovered the first interesting pattern:
- Customers who churned had an average tenure of 17 months
- Customers who stayed had an average tenure of 37 months
- Churned customers were paying about £13 more on average
But the real gold was discovering the "premium but unprotected" segment. Customers with fiber optic internet but no protection services. These customers were churning at alarming rates.
Stage 3: When Real Data Fights Back
The TotalCharges Mystery
My first reality check came with a cryptic error:
TypeError: can only concatenate str (not "int") to str
The TotalCharges
column was stored as strings! This is exactly the kind of thing that happens in real projects. Some investigation revealed:
# The culprit: empty strings in numeric columns
non_numeric = df[pd.to_numeric(df['TotalCharges'], errors='coerce').isna()]
print(f"Found {len(non_numeric)} rows with non-numeric TotalCharges")
Solution: Created a robust data cleaning function that handles these edge cases gracefully.
The Case of the Persistent Nulls
Even after cleaning, I still had 11 null values. The issue? My tenure buckets didn't include 0:
# Before: bins=[0, 6, 12, 24, 48, 72]
# After: bins=[-1, 6, 12, 24, 48, 72] # Include 0 tenure
Lesson learned: Always validate your assumptions about data ranges.
Stage 4: The Model That Humbled Me
Here's where things got interesting. I built three models:
- Baseline Logistic Regression: 84.0% AUC
- XGBoost with tuning: 83.6% AUC
- LightGBM with regularization: 83.8% AUC
Wait, what? The simple model beat the complex ones?
This sent me down a debugging rabbit hole. Was it overfitting? Bad hyperparameters? No - the truth was simpler and more profound.
Stage 5: The Probability Distribution Detective Work
When I noticed that my business optimiser always selected a 0.1 threshold regardless of parameters, I knew something was wrong. Plotting the probability distributions revealed the issue:

The model was extremely conservative:
- No Churn: Heavily concentrated near 0 (good!)
- Churn: Spread across 0.2-0.8 with low density (bad!)
This is classic behavior with imbalanced datasets (74% no churn, 26% churn).
Attempts to Fix:
- Class weights: Helped a bit
- SMOTE: Marginal improvement
- Polynomial features: Still overlapping distributions
- Ensemble methods: No significant gains
The hard truth? The features simply didn't contain enough signal to perfectly separate the classes.
Stage 6: Embracing Reality and Optimising for Business Value
Instead of chasing marginal ML metrics improvements, I pivoted to what actually matters: business value.
I built a custom business optimiser that considers:
- Cost of retention campaigns: $10 per customer
- Success rate of retention: 30%
- Customer lifetime value: Monthly charges × 12
The results were eye-opening:
# At 0.5 threshold: Losing money!
# At optimized 0.35 threshold: 245% ROI
Stage 7: From Notebook to Production
This is where the real engineering began:
Centralized Logging
Created a proper logging system that saves to both console and files:
logger.info("Model training completed", extra={
'auc':0.84,
'threshold':0.35,
'expected_roi':2.45
})
Production Pipeline
Built a modular system with:
- Feature engineering pipeline that can be pickled
- Model versioning and metadata tracking
- Comprehensive error handling
API Development
Created a Flask API with proper endpoints:
GET /health
POST /predict
POST /predict_batch
POST /update_threshold
The moment of truth came when I ran:
curl http://localhost:5000/health
And got back:
{"status": "healthy", "model_version": "production", "model_loaded": true}
Success! The API was working!
Stage 8: The Testing Saga
Writing tests revealed even more edge cases:
- What if required fields are missing?
- What if TotalCharges is missing entirely?
- How do we handle malformed JSON?
Each failing test taught me something about production readiness.
Key Takeaways
- Simple models can be the right choice: My logistic regression beat XGBoost because the dataset was small and the relationships were largely linear.
- Data quality > Model complexity: Hours spent on feature engineering and data cleaning paid off more than hyperparameter tuning.
- Business metrics > ML metrics: A model with 84% AUC that optimises for ROI beats a 90% AUC model that doesn't.
- Production is 90% of the work: Getting from notebook to API required more code than the actual modeling.
- Logging is not optional: When debugging production issues, logs are your lifeline.
What I'd Do Differently
- Start with the business optimiser earlier
- Invest more in data quality checks upfront
- Build the API structure parallel to model development
- Add data drift monitoring from the beginning
Conclusion
This project taught me that ML engineering is about so much more than choosing the right algorithm. It's about:
- Understanding the business problem deeply
- Building robust, maintainable systems
- Making pragmatic decisions based on constraints
- Documenting and testing thoroughly
The final system may use "just" logistic regression, but it's production-ready, well-tested, and optimised for real business value. Sometimes that's exactly what machine learning engineering looks like in the real world.
Want to explore the code? Check out the GitHub repository for the complete implementation.