tutorials

AI Model Deployment: From Training to Production

LearnClub AI
February 28, 2026
4 min read

AI Model Deployment: From Training to Production

Deploying AI models to production requires more than just training. This guide covers the complete MLOps pipeline from model to scalable service.

Deployment Options

1. Cloud APIs

Providers:

  • OpenAI API
  • Anthropic API
  • Google Vertex AI
  • AWS SageMaker
  • Azure ML

Pros:

  • No infrastructure
  • Auto-scaling
  • Easy integration

Cons:

  • Vendor lock-in
  • Ongoing costs
  • Limited customization

2. Self-Hosted

Options:

  • Docker containers
  • Kubernetes
  • VM instances
  • Bare metal

Pros:

  • Full control
  • Cost optimization
  • Data privacy

Cons:

  • Infrastructure management
  • Scaling complexity
  • DevOps required

3. Edge Deployment

For:

  • Mobile apps
  • IoT devices
  • Browser inference
  • Offline use

Tools:

  • ONNX Runtime
  • TensorFlow Lite
  • Core ML
  • Transformers.js

Deployment Pipeline

Step 1: Model Packaging

# Save model
import joblib
joblib.dump(model, 'model.pkl')

# Or for neural networks
torch.save(model.state_dict(), 'model.pth')

Step 2: Create API

from fastapi import FastAPI
import joblib

app = FastAPI()
model = joblib.load('model.pkl')

@app.post("/predict")
def predict(data: dict):
    prediction = model.predict([data['features']])
    return {"prediction": prediction.tolist()}

Step 3: Containerize

FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0"]

Step 4: Deploy

# Build
docker build -t my-model-api .

# Run locally
docker run -p 8000:8000 my-model-api

# Push to registry
docker push registry/my-model-api

# Deploy to cloud
kubectl apply -f deployment.yaml

Scaling Strategies

Horizontal Scaling

# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Batching

Batch requests for efficiency:

@app.post("/predict_batch")
def predict_batch(data: BatchData):
    predictions = model.predict(data.features)
    return {"predictions": predictions.tolist()}

Caching

from functools import lru_cache

@lru_cache(maxsize=1000)
def get_prediction(feature_hash):
    return model.predict(features)

Monitoring

Key Metrics

MetricTargetAlert
Latency (p95)<200ms>500ms
Throughput>1000 RPS<500 RPS
Error Rate<1%>5%
Model Drift<0.1>0.2

Tools

  • Prometheus + Grafana: Metrics and dashboards
  • Evidently: Model monitoring
  • Weights & Biases: Experiment tracking
  • Seldon: Model serving with monitoring

Implementation

from prometheus_client import Counter, Histogram

predictions = Counter('model_predictions_total', 'Total predictions')
latency = Histogram('model_latency_seconds', 'Request latency')

@app.post("/predict")
@latency.time()
def predict(data: dict):
    predictions.inc()
    return model.predict(data)

Best Practices

1. Version Control

models/
β”œβ”€β”€ v1.0.0/
β”‚   β”œβ”€β”€ model.pkl
β”‚   └── metadata.json
β”œβ”€β”€ v1.1.0/
β”‚   β”œβ”€β”€ model.pkl
β”‚   └── metadata.json

2. A/B Testing

# Route 10% traffic to new model
if random.random() < 0.1:
    return model_v2.predict(data)
else:
    return model_v1.predict(data)

3. Rollback Strategy

# Quick rollback
kubectl rollout undo deployment/model-api

4. Circuit Breaker

Fail fast when model is unhealthy:

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
def predict_with_circuit(data):
    return model.predict(data)

Cost Optimization

Strategies

  1. Spot Instances: 70% cost savings
  2. Auto-scaling: Scale to zero
  3. Model Compression: Smaller models
  4. Caching: Reduce compute
  5. Batching: Higher throughput

Example Savings

SetupMonthly Cost
Always-on GPU$2,000
Auto-scaling$600
Spot + scaling$200

Security

Best Practices

  • API authentication (JWT, API keys)
  • Input validation
  • Rate limiting
  • Model encryption at rest
  • Audit logging
from fastapi import Security, HTTPException
from fastapi.security import APIKeyHeader

api_key_header = APIKeyHeader(name="X-API-Key")

@app.post("/predict")
def predict(
    data: dict,
    api_key: str = Security(api_key_header)
):
    if api_key != VALID_API_KEY:
        raise HTTPException(status_code=403)
    return model.predict(data)

Learn more MLOps in our guides section.

Share this article