AI Behind APIs: Microservices Patterns for ML Systems

March 19, 2026

ML models need to live somewhere

You've trained a model. It works in a notebook. Now someone needs to call it from a web app, a mobile client, and a batch pipeline. The question isn't whether to put it behind an API. It's how to structure the service around it so it doesn't become a nightmare to operate.

ML inference has different characteristics than typical web services. It's compute-heavy, often GPU-bound, has variable latency depending on input size, and the "code" (model weights) changes on a separate cadence from the serving logic. Standard microservice patterns need adaptation for this.

Model-as-a-service

The simplest pattern: one model, one service, one API. A container runs your model with a REST or gRPC endpoint. Send input, get predictions back.

POST /v1/predict
{
  "text": "This product is terrible",
  "model_version": "sentiment-v3"
}

Response:
{
  "label": "negative",
  "confidence": 0.94,
  "model_version": "sentiment-v3",
  "latency_ms": 23
}

This works for a lot of cases. FastAPI + Uvicorn for Python models, Triton Inference Server for GPU-optimized serving, TorchServe if you're in the PyTorch ecosystem. The service owns its model lifecycle: loading weights, managing GPU memory, handling versioning.

Where it gets tricky: GPU utilization. A single model sitting on a GPU waiting for requests wastes expensive hardware. If your traffic is bursty (lots of requests at 2pm, nothing at 3am), you're either over-provisioned or under-provisioned.

Async inference with queues

Synchronous request-response doesn't suit every ML workload. Image generation takes seconds. Document processing takes minutes. Video analysis can take hours. Your API gateway will time out long before some of these finish.

The pattern: accept the request, return a job ID, process asynchronously, let the client poll or receive a webhook when it's done.

POST /v1/jobs
→ {"job_id": "abc-123", "status": "queued"}

GET /v1/jobs/abc-123
→ {"job_id": "abc-123", "status": "completed", "result": {...}}

Use a message queue (SQS, RabbitMQ, Redis Streams) between the API and the inference workers. This gives you natural backpressure; if inference can't keep up, the queue grows instead of the service falling over. Workers pull jobs at their own pace.

The downside is complexity. You need a job store (where results live until the client retrieves them), dead letter queues for failed jobs, and monitoring for queue depth. If queue depth is growing faster than workers drain it, you need more workers or your model is too slow.

Batching strategies

ML models, especially on GPUs, are more efficient processing multiple inputs at once. A model that takes 10ms for one input might take 12ms for a batch of 8. That's an 85% throughput improvement.

Client-side batching is the simplest. The caller collects multiple requests and sends them together. This works for batch pipelines but not for real-time serving where each request comes from a different user.

Server-side batching is more useful. The inference service collects incoming requests for a short window (say 50ms), packs them into a batch, runs inference once, and returns individual results. Triton Inference Server and TensorFlow Serving both support this natively.

The tuning parameters are batch window (how long to wait for more requests) and max batch size (when to stop waiting and run). Too short a window and you batch nothing. Too long and you add latency. In practice, I've found 20-50ms windows with max batch sizes of 16-64 work for most real-time serving scenarios. Test with your actual traffic patterns.

Dynamic batching adjusts the window based on load. Under high traffic, batches fill quickly and the window barely matters. Under low traffic, you reduce the window to avoid adding latency for sparse requests. Most serving frameworks handle this automatically if you configure the bounds.

The sidecar pattern for model serving

The sidecar pattern deploys a model-serving container alongside your application container. They share a pod (in Kubernetes terms) or a task definition (in ECS). The app container handles business logic, auth, and routing. The sidecar handles model loading, inference, and GPU management.

Why bother? Separation of concerns. Your application team writes Python or Go or whatever they want. The ML team packages models into a standard sidecar image. They deploy independently. The app calls the sidecar over localhost, so network latency is negligible.

This pattern works well when you have multiple application services that all need ML capabilities. Instead of each team embedding model-serving code in their service, they all use the same sidecar image. Model updates roll out by updating the sidecar image, and the application code doesn't change.

Envoy proxy uses this same pattern for networking. The ML sidecar is the same idea applied to inference.

API gateway patterns for model routing

When you're running multiple model versions or A/B testing models, you need something to route traffic. An API gateway sits in front of your model services and handles:

Version routing sends 90% of traffic to v3 and 10% to v4. This is your canary deployment mechanism for models. If v4's metrics degrade, shift traffic back to v3 without any redeployment.

Shadow mode sends traffic to the production model AND a candidate model. Only return the production model's response, but log both. You get real-world evaluation data without any risk to users.

Feature-based routing sends premium users to a larger, more expensive model and free users to a smaller one. You can also route based on input characteristics: short text to a lightweight model, long documents to a more capable one.

Kong, Envoy, or a custom NGINX config can handle this. For simpler setups, a Python service with routing logic works fine. Don't over-engineer the gateway until you have enough model variants to justify it.

The hard operational stuff

Cold starts are painful with ML models. A container that takes 30 seconds to load a 2GB model into GPU memory is a problem if your autoscaler needs to react to traffic spikes. Mitigations: keep minimum replicas warm, use model caching (load weights from shared storage instead of bundling them in the container image), or use smaller models that load faster.

GPU sharing is an ongoing headache. MIG (Multi-Instance GPU) on NVIDIA A100s and H100s lets you partition a GPU for multiple models. Without MIG, you're either running one model per GPU (wasteful for small models) or using CUDA MPS to time-share (which works but adds complexity). Kubernetes device plugins help with scheduling, but GPU scheduling is still rougher than CPU scheduling.

Cost optimization comes down to utilization. Track GPU utilization per service. If a model service averages 15% GPU utilization, it's a candidate for consolidation. Either share the GPU with another model or move to a smaller instance. Spot/preemptible instances work for async inference where you can tolerate interruptions. They don't work for real-time serving where you need consistent availability.

When not to use microservices for AI

Sometimes a monolith is the right call. I mean this genuinely.

If you have one model, one team, and moderate traffic, a single service with the model baked in is simpler, cheaper, and easier to debug. You don't need Kubernetes, a service mesh, and a message queue to serve a text classifier.

If your model is tightly coupled to business logic (the model output feeds directly into a decision that feeds back into the next model call within the same request), splitting it across services adds latency and complexity for no benefit.

If you're a team of two or three, the operational overhead of multiple services, a queue, a gateway, and distributed tracing will consume more engineering time than it saves. Build the monolith, put it behind a load balancer, and revisit when you actually hit scaling problems.

The microservice patterns in this post are tools. Use them when the problem demands it: multiple models, multiple teams, independent scaling needs, different deployment cadences. Don't use them because they look good on an architecture diagram.

—

AI Behind APIs: Microservices Patterns for ML Systems

ML models need to live somewhere

Model-as-a-service

Async inference with queues

Batching strategies

The sidecar pattern for model serving

API gateway patterns for model routing

The hard operational stuff

When not to use microservices for AI

Share this article

You Might Also Like

VoxCPM: Studio-Quality Voice Synthesis You Can Run Locally

Matrioshka Brains and the Kardashev Scale: What Civilization-Scale Computing Actually Looks Like

The Great Displacement: What 245,000 Tech Layoffs Are Actually Doing to the Industry