AI Model Deployment: From Training to Production
Deploying AI models to production requires more than just training. This guide covers the complete MLOps pipeline from model to scalable service.
Deployment Options
1. Cloud APIs
Providers:
- OpenAI API
- Anthropic API
- Google Vertex AI
- AWS SageMaker
- Azure ML
Pros:
- No infrastructure
- Auto-scaling
- Easy integration
Cons:
- Vendor lock-in
- Ongoing costs
- Limited customization
2. Self-Hosted
Options:
- Docker containers
- Kubernetes
- VM instances
- Bare metal
Pros:
- Full control
- Cost optimization
- Data privacy
Cons:
- Infrastructure management
- Scaling complexity
- DevOps required
3. Edge Deployment
For:
- Mobile apps
- IoT devices
- Browser inference
- Offline use
Tools:
- ONNX Runtime
- TensorFlow Lite
- Core ML
- Transformers.js
Deployment Pipeline
Step 1: Model Packaging
# Save model
import joblib
joblib.dump(model, 'model.pkl')
# Or for neural networks
torch.save(model.state_dict(), 'model.pth')
Step 2: Create API
from fastapi import FastAPI
import joblib
app = FastAPI()
model = joblib.load('model.pkl')
@app.post("/predict")
def predict(data: dict):
prediction = model.predict([data['features']])
return {"prediction": prediction.tolist()}
Step 3: Containerize
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0"]
Step 4: Deploy
# Build
docker build -t my-model-api .
# Run locally
docker run -p 8000:8000 my-model-api
# Push to registry
docker push registry/my-model-api
# Deploy to cloud
kubectl apply -f deployment.yaml
Scaling Strategies
Horizontal Scaling
# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Batching
Batch requests for efficiency:
@app.post("/predict_batch")
def predict_batch(data: BatchData):
predictions = model.predict(data.features)
return {"predictions": predictions.tolist()}
Caching
from functools import lru_cache
@lru_cache(maxsize=1000)
def get_prediction(feature_hash):
return model.predict(features)
Monitoring
Key Metrics
| Metric | Target | Alert |
|---|---|---|
| Latency (p95) | <200ms | >500ms |
| Throughput | >1000 RPS | <500 RPS |
| Error Rate | <1% | >5% |
| Model Drift | <0.1 | >0.2 |
Tools
- Prometheus + Grafana: Metrics and dashboards
- Evidently: Model monitoring
- Weights & Biases: Experiment tracking
- Seldon: Model serving with monitoring
Implementation
from prometheus_client import Counter, Histogram
predictions = Counter('model_predictions_total', 'Total predictions')
latency = Histogram('model_latency_seconds', 'Request latency')
@app.post("/predict")
@latency.time()
def predict(data: dict):
predictions.inc()
return model.predict(data)
Best Practices
1. Version Control
models/
βββ v1.0.0/
β βββ model.pkl
β βββ metadata.json
βββ v1.1.0/
β βββ model.pkl
β βββ metadata.json
2. A/B Testing
# Route 10% traffic to new model
if random.random() < 0.1:
return model_v2.predict(data)
else:
return model_v1.predict(data)
3. Rollback Strategy
# Quick rollback
kubectl rollout undo deployment/model-api
4. Circuit Breaker
Fail fast when model is unhealthy:
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60)
def predict_with_circuit(data):
return model.predict(data)
Cost Optimization
Strategies
- Spot Instances: 70% cost savings
- Auto-scaling: Scale to zero
- Model Compression: Smaller models
- Caching: Reduce compute
- Batching: Higher throughput
Example Savings
| Setup | Monthly Cost |
|---|---|
| Always-on GPU | $2,000 |
| Auto-scaling | $600 |
| Spot + scaling | $200 |
Security
Best Practices
- API authentication (JWT, API keys)
- Input validation
- Rate limiting
- Model encryption at rest
- Audit logging
from fastapi import Security, HTTPException
from fastapi.security import APIKeyHeader
api_key_header = APIKeyHeader(name="X-API-Key")
@app.post("/predict")
def predict(
data: dict,
api_key: str = Security(api_key_header)
):
if api_key != VALID_API_KEY:
raise HTTPException(status_code=403)
return model.predict(data)
Learn more MLOps in our guides section.