A bank needs to approve or decline a transaction in 200 milliseconds. A logistics platform needs to reroute 10,000 shipments when a port closes. A manufacturing system needs to adjust production parameters before a defect propagates down the line.

These aren't batch jobs. They're real-time decisions that need to be fast, correct, and auditable. And they need to work at scale — not dozens of decisions per minute, but thousands per second.

We build these systems. Here's how they work.

Why Real-Time Is Hard

Making a single fast decision is easy. Making thousands of fast decisions per second, each one correct and traceable, with graceful degradation under load — that's a different problem entirely.

The challenges compound:

Consistency under concurrency. When two decisions need the same data and that data is changing, you need a consistency model that's both correct and fast. Strong consistency kills latency. Eventual consistency introduces errors. You need to be precise about which data needs to be consistent and which can tolerate staleness.

Latency budgets. In a 200ms end-to-end budget, every component gets a slice. Network: 20ms. Data retrieval: 40ms. Model inference: 60ms. Business rules: 30ms. Response serialization: 10ms. That leaves 40ms for everything you forgot about. If any component exceeds its budget, the decision misses its deadline.

Failure isolation. When one downstream service fails, it can't cascade into the decision pipeline. A slow fraud detection model can't block a payment. A failing enrichment service can't prevent a routing decision. Each component needs independent failure handling.

Auditability at volume. Regulators don't care that you make 50,000 decisions per second. They care about one specific decision made at 14:23:07 on March 15th. Your system needs to produce a complete decision trace for any individual decision without impacting throughput.

The Architecture

Our real-time decision systems follow a consistent pattern, refined across multiple deployments:

Decision Pipeline

Event Stream → Enrichment → Decision Engine → Action Dispatcher → Audit Log
                   ↓              ↓                                    ↓
              Feature Store    Model Cache                      Trace Store

Every decision flows through five stages:

  1. Event ingestion — Events arrive via message queue. Each event is immutable and timestamped. Order is preserved within partition keys.

  2. Enrichment — The raw event is enriched with contextual data from the feature store. Customer history, product attributes, risk scores. This data is pre-computed and cached — enrichment is a lookup, not a calculation.

  3. Decision engine — The enriched event hits the decision engine, which applies a combination of deterministic rules and model inference. Rules handle the common cases. Models handle the complex ones. The split is configurable per decision type.

  4. Action dispatch — The decision triggers downstream actions. Approve the transaction. Reroute the shipment. Adjust the parameter. Actions are dispatched asynchronously — the decision is logged before the action completes.

  5. Audit logging — Every decision produces a trace: inputs, enrichment data, rules evaluated, model scores, final decision, and confidence level. Traces are written to an append-only store for compliance and debugging.

The Feature Store

The feature store is the most underappreciated component. It's what makes enrichment fast enough to fit within the latency budget.

Features are pre-computed on two timescales:

  • Batch features — Updated hourly or daily. Customer lifetime value, historical patterns, aggregate statistics. Computed by batch pipelines and materialized into the store.
  • Streaming features — Updated in real-time. Transaction velocity, recent behavior, current session data. Computed by stream processors and written to the store with sub-second latency.

At decision time, the enrichment stage reads from the store. No computation. Just a key-value lookup. This is how we keep enrichment under 40ms even when the feature set includes hundreds of attributes.

Model Serving

Model inference is the most variable component in the pipeline. A model that averages 30ms might spike to 200ms under load. We handle this with three strategies:

Model distillation. Production models are distilled versions of our training models. Smaller, faster, optimized for inference. Accuracy loss is typically under 0.5% — a trade we'll make every time for 3x latency improvement.

Timeout with fallback. If model inference exceeds its latency budget, the decision falls back to rules-only mode. The decision is still made. It's slightly less optimal, but it's on time. We track fallback rates as a key metric — a rising fallback rate means the model needs optimization.

Horizontal scaling with routing. Model replicas are load-balanced with latency-aware routing. Requests go to the replica with the lowest current latency, not round-robin. This smooths out individual replica performance variance.

Consistency Model

The hardest architectural decision is the consistency model. Here's our approach:

Decision inputs are eventually consistent. Feature store data may be milliseconds stale. For most decisions, this is acceptable. A customer's risk score computed 500ms ago is close enough.

Decision outputs are strongly consistent. Once a decision is made, it's immediately visible to all downstream systems. No lag, no conflict, no "the system approved it but the dashboard shows pending."

Decision traces are immutable. Written once, never modified. This simplifies the audit story enormously. You don't need to track changes to decision records because decision records don't change.

This mixed consistency model gives us the performance of eventually consistent systems where it matters (input data) and the correctness of strongly consistent systems where it matters (outputs and audit).

Scaling Patterns

Real-time decision systems need to handle load spikes without degrading. Our scaling approach:

Partition by decision type. Different decision types have different latency requirements and different resource profiles. Transaction approval has a 200ms budget and needs GPU for model inference. Shipment routing has a 2-second budget and is CPU-bound. They run on separate infrastructure with separate scaling policies.

Pre-warm, don't auto-scale. Auto-scaling is too slow for real-time systems. By the time a new instance is ready, the spike might be over. We pre-warm capacity based on predictive load models. Monday morning transaction volume is predictable. Black Friday is predictable. Pre-warm for the expected peak, auto-scale for the unexpected.

Shed load gracefully. When capacity is truly exhausted, drop to rules-only mode system-wide. Model inference is the expensive component. Rules-only mode handles 10x the throughput with slightly lower decision quality. It's the right trade-off when the alternative is dropping requests.

What We Measure

Four metrics define the health of a real-time decision system:

  • P99 latency — Not average, not median. The 99th percentile. If 1% of decisions miss their deadline, that's 500 missed decisions per second at 50K throughput.
  • Decision accuracy — Measured against offline evaluation. How do real-time decisions compare to decisions made with perfect information and unlimited time?
  • Fallback rate — What percentage of decisions bypass model inference? Rising fallback rates indicate infrastructure problems.
  • Trace completeness — What percentage of decisions have complete audit traces? This must be 100%. Anything less is a compliance risk.

When You Need This

Not every decision needs to be real-time. Batch processing is simpler, cheaper, and perfectly adequate for many use cases. Build a real-time decision system when:

  • Decisions have deadline pressure (transactions, routing, safety)
  • Decision quality degrades with latency (the best answer in 5 minutes is worse than a good answer in 200ms)
  • Decision volume makes human review impossible
  • Regulatory requirements demand individual decision traceability

If your use case checks these boxes, the architecture described here will work. We've proven it in production across industries.

If you're not sure, we should talk. The wrong architecture for the wrong problem is more expensive than either problem alone.