Hybrid AI: When LLMs Meet Expert Systems

LLMs are bad at certain things. Not "sometimes unreliable" bad -- structurally bad, in ways that more training data won't fix. Ask GPT-4 to apply a 47-step compliance checklist to a loan application and it'll miss steps. Ask it to do multi-step arithmetic and it'll get confident wrong answers. Ask it to follow a decision tree with 200 branches and it'll hallucinate paths that don't exist.
This isn't a criticism. It's a description of what statistical text prediction is good at and what it isn't. The interesting question is what you do about it.
Where LLMs fall apart
Three categories of failure show up repeatedly:
Deterministic rule following. Tax code, regulatory compliance, clinical protocols. These are if-then-else trees, sometimes thousands of branches deep. LLMs approximate the rules from training data rather than executing them. Close enough works for a blog post. It does not work for deciding whether a drug interaction is contraindicated.
Precise computation. Anything involving exact math, date calculations, or logical deduction over structured data. LLMs can fake simple arithmetic, but compound interest calculations, statistical tests, or constraint satisfaction problems need actual computation.
Consistency over long sequences. Ask an LLM to maintain state across a 50-step process -- checking each step against previous outputs, tracking running totals, applying cascading rules. Context windows help, but the model is still predicting text, not executing a program. Errors accumulate.
The hybrid architecture
The fix is straightforward: let the LLM do what it's good at (understanding natural language, reasoning about ambiguity, generating explanations) and delegate the rest to systems built for precision.
A typical hybrid setup looks like this:
User input (natural language)
|
v
LLM Orchestrator
(understands intent, extracts parameters, decides routing)
|
+---> Rule Engine (compliance checks, decision trees)
|
+---> Calculator/Solver (math, optimization, constraints)
|
+---> Knowledge Graph (structured relationships, taxonomy lookups)
|
+---> Database (exact record retrieval, aggregation)
|
v
LLM Synthesizer
(combines outputs into natural language response)
The LLM sits at both ends. It interprets the question and it explains the answer. But it doesn't compute the answer. The deterministic modules do that.
Neuro-symbolic AI
This pattern has a name in the research world: neuro-symbolic AI. The "neuro" part is the LLM (or any neural network). The "symbolic" part is classical AI -- rule engines, logic solvers, knowledge graphs, ontologies.
The idea has been around since the 1990s, but it's having a moment because LLMs finally provide a good enough natural language interface. Before LLMs, the bottleneck was always getting unstructured human input into a format the symbolic system could work with. Now you can use an LLM to parse "Is my patient on anything that interacts with warfarin?" into a structured query against a drug interaction database.
The symbolic side gives you things LLMs can't provide on their own:
- Guaranteed correctness for well-defined problems. A rule engine either follows the rules or it has a bug. There's no hallucination.
- Explainability. You can trace exactly which rules fired and why. Try getting that from an LLM.
- Auditability. Regulators want to see the decision path, not a probability distribution.
- Consistency. Same input, same output, every time.
Real examples
Medical diagnosis systems. Several hospital systems run setups where an LLM handles the patient-facing conversation -- gathering symptoms, asking follow-up questions, understanding context. But the actual diagnostic reasoning goes through a clinical rule engine built from medical guidelines. The LLM extracts structured data from the conversation (symptoms: fever, duration: 3 days, severity: moderate). The rule engine runs differential diagnosis logic. The LLM then explains the results back to the clinician.
This matters because medical guidelines change. When a new protocol comes out, you update the rule engine. You don't retrain the LLM and hope it picks up the change.
Financial compliance. Banks process thousands of transactions against anti-money-laundering rules daily. The rules are specific: flag transactions above a threshold, flag patterns matching known laundering schemes, flag transfers to sanctioned entities. An LLM helps with the fuzzy parts -- is this customer's explanation for a large transfer plausible? Does this free-text description of a business match the declared industry code? But the actual compliance checks run on a rule engine that implements the regulations exactly.
Legal contract analysis. An LLM reads a contract and understands what it says in plain language. A rule engine checks whether specific required clauses are present, whether indemnification limits fall within policy, whether termination provisions match the template. The LLM handles the language; the rules handle the policy.
How to wire it together
The practical architecture has a few components:
Intent classifier. The LLM's first job is figuring out what kind of question this is. Does it need a rule engine? A calculation? A database lookup? Or can the LLM handle it directly? This routing decision is where most of the prompt engineering effort goes.
Schema extraction. The LLM converts natural language into structured inputs that the deterministic modules expect. "What's the tax on a $50,000 salary in California for someone filing single?" becomes {income: 50000, state: "CA", filing_status: "single"}. This step needs to be reliable -- garbage in, garbage out.
Module execution. The deterministic system runs and returns structured output. No ambiguity, no creativity, just the right answer.
Response synthesis. The LLM takes the structured output and generates a human-readable response. It can add context, caveats, and follow-up suggestions that the rule engine doesn't know about.
Fallback handling. Sometimes the LLM can't extract clean parameters, or the question doesn't map to any module. You need a graceful fallback -- either ask for clarification or let the LLM answer with appropriate uncertainty disclaimers.
The tricky parts
Schema drift. The LLM's extraction and the rule engine's expectations need to stay in sync. When someone adds a new field to the rule engine, the LLM needs to know about it. This is a versioning problem, and it's annoying.
Routing errors. If the intent classifier sends a compliance question to the general LLM instead of the rule engine, you get a plausible-sounding wrong answer. Worse, nobody might notice because the LLM's response looks reasonable. Test your routing thoroughly.
Latency. Adding deterministic modules adds round trips. For real-time applications, you need to think about which modules can run in parallel and which need to be sequential.
Maintenance burden. You're now maintaining two systems: the LLM pipeline and the rule engine. The rule engine needs domain experts to update it. The LLM pipeline needs ML engineers. These are different teams with different skill sets, and they need to coordinate.
When to go hybrid vs. pure LLM
Use a hybrid approach when:
- Wrong answers have real consequences (medical, financial, legal)
- You need to explain your reasoning to a regulator or auditor
- The rules are well-defined and change through a formal process
- Consistency matters more than flexibility
Stick with a pure LLM when:
- Approximate answers are fine
- The problem is genuinely ambiguous and there's no "right" answer
- You need to handle inputs that no rule engine could anticipate
- Speed of development matters more than precision
Most interesting problems end up needing both. The question is where you draw the line between "let the LLM handle it" and "route this to a deterministic system." That line shifts as models get better, but I don't think it disappears. There will always be problems where you want a guarantee, not a probability.


