Introduction to MLOps
Machine Learning Operations (MLOps) is the discipline of operationalizing machine learning models — taking them from research experiments to reliable, scalable, and maintainable production systems. It combines DevOps principles with machine learning-specific requirements to address the unique challenges of deploying and managing ML systems.
The gap between developing a model in a notebook and deploying it in production is vast. A model that achieves 99% accuracy in a controlled environment may fail catastrophically in production due to data drift, infrastructure issues, or unexpected inputs. MLOps provides the practices, tools, and culture to bridge this gap.
1. The ML Lifecycle: From Experiment to Production
2. Data Versioning and Management
Data is the most critical and often most problematic component of ML systems. MLOps requires rigorous data management practices.
Data Management Best Practices
- Data Lineage: Tracking where data comes from and how it's transformed
- Data Validation: Schema validation, quality checks, anomaly detection
- Feature Store: Centralized repository for reusable features
- Data Catalogs: Discoverable, documented data assets
# DVC (Data Version Control) example # Track data files dvc add data/training.csv git add data/training.csv.dvc data/.gitignore git commit -m "Add dataset v1" # Push to remote storage dvc push # Reproduce pipeline with specific data version git checkout v1.0 dvc checkout
3. Experiment Tracking
ML development involves countless experiments with different hyperparameters, architectures, and data versions. Experiment tracking provides organization and reproducibility.
# MLflow experiment tracking
import mlflow
mlflow.set_experiment("churn_prediction")
with mlflow.start_run():
# Log parameters
mlflow.log_param("learning_rate", 0.01)
mlflow.log_param("n_estimators", 100)
# Log metrics
mlflow.log_metric("accuracy", 0.92)
mlflow.log_metric("f1_score", 0.89)
# Log artifacts
mlflow.sklearn.log_model(model, "model")
mlflow.log_artifact("confusion_matrix.png")
4. Model Versioning and Registry
The model registry is the single source of truth for production models, tracking version history, metadata, and deployment status.
# Model registry with MLflow
from mlflow.tracking import MlflowClient
client = MlflowClient()
model_name = "churn_classifier"
# Register model
model_version = client.create_model_version(
name=model_name,
source="runs://model",
run_id=""
)
# Transition stage
client.transition_model_version_stage(
name=model_name,
version=model_version.version,
stage="Staging"
)
# Promote to production
client.transition_model_version_stage(
name=model_name,
version=model_version.version,
stage="Production"
)
5. CI/CD for Machine Learning
CI/CD pipelines automate testing, validation, and deployment of ML models, ensuring reliability and reducing manual errors.
# GitHub Actions workflow for ML
# .github/workflows/ml-pipeline.yml
name: ML Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run unit tests
run: pytest tests/
- name: Data validation
run: python scripts/validate_data.py
- name: Model training & validation
run: python scripts/train.py --validate-only
6. Model Deployment Strategies
Deployment Patterns
- Canary Deployment: Gradual rollout to small percentage of traffic
- Blue-Green Deployment: Two environments for zero-downtime switching
- A/B Testing: Compare multiple model versions
- Shadow Deployment: Run new model alongside production without serving
# Model serving with FastAPI
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
app = FastAPI()
model = joblib.load("model.pkl")
class PredictionRequest(BaseModel):
features: list
@app.post("/predict")
async def predict(request: PredictionRequest):
prediction = model.predict([request.features])
return {"prediction": prediction.tolist()}
# Run with: uvicorn app:app --host 0.0.0.0 --port 8000
7. Model Monitoring
Monitoring is critical for maintaining model performance in production. Key monitoring dimensions:
Monitoring Types
- Data Drift: Changes in input data distribution (PSI, KL divergence)
- Concept Drift: Changes in relationship between inputs and outputs
- Performance Monitoring: Tracking accuracy, precision, recall when ground truth available
- Operational Metrics: Latency, throughput, error rates, resource usage
# Drift detection with evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference, current_data=current)
report.save_html("drift_report.html")
# Population Stability Index (PSI)
def calculate_psi(expected, actual, bins=10):
expected_percents = np.histogram(expected, bins=bins)[0] / len(expected)
actual_percents = np.histogram(actual, bins=bins)[0] / len(actual)
psi = np.sum((actual_percents - expected_percents) *
np.log(actual_percents / expected_percents))
return psi
8. Feature Store
A feature store centralizes feature engineering, ensuring consistency between training and serving, and enabling feature reuse across models.
# Feast feature store example
from feast import FeatureStore
# Initialize feature store
store = FeatureStore(repo_path="feature_repo")
# Get training data
training_df = store.get_historical_features(
entity_df=entity_df,
features=[
"user_features:age",
"user_features:location",
"transaction_features:avg_amount"
]
).to_df()
# Get online features for serving
features = store.get_online_features(
features=[
"user_features:age",
"user_features:location"
],
entity_rows=[{"user_id": 12345}]
).to_dict()
9. LLMOps: MLOps for Large Language Models
LLMs introduce unique challenges that extend traditional MLOps practices.
- Prompt Management: Version control for prompts, templates, and chains
- Cost Monitoring: Tracking token usage and API costs
- Hallucination Detection: Monitoring for factual inaccuracies
- Safety Filtering: Preventing harmful or inappropriate outputs
- RAG Evaluation: Assessing retrieval-augmented generation quality
# LLM monitoring with LangSmith
import langsmith
client = langsmith.Client()
# Log LLM interaction
client.create_run(
name="chat_completion",
inputs={"prompt": "What is MLOps?"},
outputs={"response": "MLOps is..."},
metadata={"model": "gpt-4", "tokens": 125}
)
# Evaluate with custom criteria
from langsmith.evaluation import evaluate
evaluate(
lambda inputs: llm.predict(inputs["prompt"]),
data=test_dataset,
evaluators=[accuracy_evaluator, safety_evaluator]
)
10. Infrastructure and Scalability
Infrastructure Options
- On-Premises: Full control, high upfront cost, maintenance overhead
- Cloud (AWS, GCP, Azure): Managed services, scalability, pay-as-you-go
- Kubernetes: Container orchestration for scalable model serving
- Serverless: Auto-scaling, no infrastructure management
# Kubernetes deployment for model serving
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-serving
spec:
replicas: 3
selector:
matchLabels:
app: model
template:
metadata:
labels:
app: model
spec:
containers:
- name: model-container
image: model:v1.2.3
ports:
- containerPort: 8000
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"
11. Governance and Compliance
Key Governance Requirements
- Auditability: Complete lineage from data to model to decision
- Explainability: Ability to explain model decisions (GDPR "right to explanation")
- Fairness Monitoring: Continuous bias detection
- Access Control: Who can deploy models, access data, view predictions
- Version Control: All artifacts versioned and traceable
12. MLOps Tools and Platforms
13. Best Practices and Anti-Patterns
Best Practices
- Start Simple: Begin with manual processes, automate as pain points emerge
- Version Everything: Code, data, models, configurations
- Test Thoroughly: Unit tests, integration tests, model validation tests
- Monitor Continuously: Drift, performance, operational metrics
- Document Assumptions: Model limitations, expected input ranges, failure modes
Anti-Patterns to Avoid
- Shadow ML: Unmanaged models running in production without oversight
- Notebook-as-API: Jupyter notebooks deployed directly as services
- Model Silos: No sharing of features, code, or learnings across teams
- Over-engineering: Building complex infrastructure before proving value
- Ignoring Monitoring: Deploying models without observability
Conclusion
MLOps transforms machine learning from experimental craft to reliable engineering discipline. By applying DevOps principles to ML-specific challenges, organizations can deploy models faster, maintain them more reliably, and scale ML impact across the enterprise.
The journey to mature MLOps is incremental — starting with manual processes, adding automation, and ultimately achieving fully automated ML pipelines. Regardless of where you start, the principles of versioning, testing, monitoring, and reproducibility apply at every stage.