Skip to content

Modular AI: Building Systems You Can Actually Maintain

Modular AI: Building Systems You Can Actually Maintain

The monolith problem in ML

Most ML systems start as a single script. Data loading, preprocessing, model inference, and post-processing all live in one file, maybe one function. This works fine for prototyping. It stops working the moment you need to swap a model, update a preprocessing step, or scale one part independently.

I've inherited enough of these monoliths to know how they end up. Someone builds a proof of concept. It works. Management wants it in production by Thursday. Six months later you have a 3,000-line inference service where changing the tokenizer means touching code in fourteen places.

Modular AI is not a new idea. It's the same principle behind Unix pipes, microservices, and component-based frontend frameworks applied to ML systems. Break your pipeline into discrete modules with clear interfaces, and you can swap, scale, and debug each piece independently.

A practical module taxonomy

Most ML systems decompose into four layers. Not every system needs all four, but thinking in these categories helps.

Preprocessing modules handle everything before the model sees data. Tokenization, feature extraction, normalization, image resizing, embedding lookup. These change more often than you'd expect -- a new tokenizer version, a change in feature engineering, adding a new data source. If preprocessing is tangled with inference, every change is a full redeployment.

Inference modules wrap the model itself. They take preprocessed input, run it through the model, and return raw output. The key decision here is granularity -- do you wrap one model per module, or group related models? I've found one-model-per-module works best unless models are tightly coupled (like an encoder-decoder pair that always deploy together).

Post-processing modules transform raw model output into something useful. Thresholding classification scores, formatting text output, applying business rules, filtering unsafe content. This is where a lot of product logic lives, and it changes on a different cadence than the model itself. Keeping it separate means product teams can adjust thresholds without touching the model serving infrastructure.

Monitoring modules observe everything else -- input distributions, output distributions, latency, error rates, drift detection. These should be side-effects, not inline. Monitoring code mixed into inference code is a maintenance nightmare and a latency risk.

Interface contracts are the whole game

Modules are only as good as the interfaces between them. Without clear contracts, you end up with tightly coupled modules that just look modular but break together.

An interface contract specifies input schema, output schema, and behavioral expectations. For an inference module, that might look like:

Input: {"tokens": int[], "attention_mask": int[], "max_length": int}
Output: {"logits": float[][], "latency_ms": float}
Errors: InputValidationError, ModelTimeoutError, OOMError
SLA: p99 latency < 200ms, availability > 99.9%

The schema part is straightforward. The behavioral part -- latency bounds, error handling, degradation behavior -- is where teams usually cut corners and pay for it later.

Use schema validation at module boundaries. Pydantic models, JSON Schema, Protocol Buffers -- pick one and enforce it. The ten minutes you spend writing a schema saves hours of debugging when Module A starts sending a field that Module B doesn't expect.

Hot-swapping models without downtime

One of the biggest wins of modular design is swapping models without taking the system down. Here's how it works in practice.

The registry pattern is the foundation. You maintain a model registry (MLflow, a database, even a config file for simple cases) that maps model identifiers to artifacts and metadata. Each model version has an entry with its location, expected input/output schemas, performance benchmarks, and deployment status.

When you want to swap a model:

  1. Deploy the new model version alongside the old one
  2. Route a percentage of traffic to the new version (canary deployment)
  3. Monitor the comparison metrics
  4. If the new version passes your criteria, shift all traffic
  5. Keep the old version warm for a rollback window

The inference module doesn't need to know which specific model version it's running. It pulls from the registry based on a routing configuration. The module's code doesn't change -- only the config does.

For this to work, the new model must satisfy the same interface contract. If your new model changes the output format, it's not a swap -- it's a new module, and downstream modules need updates too.

Monolithic vs modular -- real tradeoffs

Modular doesn't always win. Here's when each makes sense.

Monolithic works when you have a single model with simple pre/post-processing, a small team (1-3 people), low change frequency, and latency requirements so tight that inter-module communication overhead matters. A single Flask app serving a sentiment classifier with minimal preprocessing is fine as a monolith. Don't over-engineer it.

Modular pays off when you have multiple models that change independently, different teams own different parts of the pipeline, you need to scale components independently (your preprocessing is CPU-bound but inference is GPU-bound), or you deploy model updates frequently. If you're updating your model weekly but your post-processing quarterly, they shouldn't be in the same deployment unit.

The overhead is real. Modular systems need service discovery or a message bus, serialization/deserialization at boundaries, distributed tracing to debug issues across modules, and more operational tooling. For a two-person team running one model, this overhead isn't worth it.

MIT's take on modular software frameworks

MIT's Computer Science and Artificial Intelligence Laboratory published research in 2025 on modular software frameworks that's relevant here. Their work focused on "composable ML" -- defining a standard set of module interfaces that allow mix-and-match composition of ML components.

The interesting finding was about failure modes. When modules have well-defined interfaces, failures are contained -- a bad preprocessing module produces garbage output, but the inference module rejects it at the boundary instead of silently producing wrong results. In their experiments, modular systems caught 73% of data quality issues at module boundaries that monolithic systems would have propagated silently to the output.

Their framework also showed that modular systems were 40% faster to debug on average. When something goes wrong, you can test each module in isolation with known inputs. With a monolith, you're adding print statements and praying.

Making it work in practice

Start with a monolith. Seriously. Build the thing, prove it works, understand the data flow. Then identify the natural seams -- the places where data transforms from one representation to another. Those are your module boundaries.

Extract modules one at a time. Preprocessing first, usually, because it changes most often and has the least coupling to the model itself. Then separate monitoring. Then, if you're updating models frequently, extract inference into its own module with registry-based routing.

Use containers for each module. Docker makes this straightforward. Each module has its own image, its own dependencies, its own scaling configuration. Your preprocessing module might need pandas and scikit-learn. Your inference module needs PyTorch and a GPU. They shouldn't share a dependency tree.

For communication between modules, keep it simple. gRPC if latency matters, HTTP/JSON if it doesn't, and message queues (Kafka, RabbitMQ) if you need async processing. Don't reach for a service mesh until you actually have enough modules to justify it.

Write integration tests that send data through the full pipeline, not just unit tests for each module. A module that passes all its unit tests can still break the system if its output schema drifts from what the next module expects. Contract testing tools like Pact help here.

The goal isn't maximum modularity. It's appropriate modularity -- enough separation to let independent things change independently, without so much overhead that you spend more time on infrastructure than on the actual ML problem.