The Curse of Non-Determinism: Putting agentic AI into production
The pattern repeats almost every time we walk into a client engagement. The pilot looked great. The model was accurate, the retrieval was clean, the demo landed. Then the system went into production and something began to go wrong that nobody could quite name. The same question started returning different answers depending on who asked it. The same ticket got escalated on Tuesday and resolved silently on Wednesday. The same contract clause got summarized one way for the legal team and another for sales. Nothing was wrong, exactly. But nothing was the same twice.
The instinct in this situation is to ask whether the model is accurate enough. It's the wrong question. Accuracy is a property of predictions; production AI systems don't make predictions, they enact policies, they retrieve, they reason, they call tools, they remember. The right question, the one we've come to ask first in every engagement, is about behavior: not whether the system gets a single answer right, but how its behavior is distributed across runs, prompts, contexts, and time.
We've come to call this the curse of non-determinism, and we think it's the central engineering problem of agentic AI in production.
Your AI agent doesn't have an accuracy problem. It has a variance problem, and accuracy metrics are structurally unable to see it.
We've been here before
There's a precedent worth naming. Two decades ago, machine learning ran into the curse of dimensionality: as you add features to a model, the feature space grows exponentially relative to your data, sample sparsity destroys distance metrics, and generalization quietly collapses. The lesson the field eventually internalized was structural, don't add features blindly. Constrain the space through regularization, feature selection, and inductive bias.
Agentic AI has its own version of this problem. Every degree of freedom we add to an agent whether it's a new tool, a longer context window, a memory store, a retrieval layer, or an additional model compounds the behavioral space the system can occupy. The math is different; this is not a statistical estimation problem. But the structural lesson rhymes. More freedom without architecture is not more capability. It is more variance.
The discipline this requires doesn't yet have a name in most organizations. It should. We call the thing being managed behavioral variance, and the rest of this piece is about how to measure it, where it comes from, and what it costs to govern.
“More freedom without architecture is not more capability. It is more variance, and variance, not accuracy, is what production breaks on.
The frame shift
For a decade, enterprise machine learning was about classifiers. The output space was small, ground truth was assemblable, and the natural metric was accuracy. Generative AI broke this regime in three ways at once.
- Closed output space, enumerable
- Held-out test set with known labels
- Single-shot scoring against ground truth
- Accuracy is the natural and sufficient metric
- Failures are statistical, not behavioural
- Open output space, no enumerable label set
- Inputs are contextual: retrieval, memory, tools
- The system acts, tool calls have side effects
- Behaviour is a distribution, not a prediction
- Failures are policy failures, often correlated in time
The output space became open. There is no enumerable list of acceptable answers to "summarize this contract." The input became contextual. Retrieved documents, prior conversation, and tool results mean the same nominal prompt rarely produces the same effective input. And the system started acting. A tool call is not a prediction; it's an event in the world, often with side effects, sometimes irreversible.
What we need to characterize, in this regime, is not a prediction but a policy, a mapping from situations to distributions over what the system says and does. Two policies can have identical pointwise accuracy on a held-out evaluation set and behave very differently in production. One drifts gracefully when a document is updated; the other doesn't. One refuses cleanly when it's outside its competence; the other confabulates. One produces consistent answers to paraphrased questions; the other doesn't.
Two systems with the same accuracy can have entirely different behavioural envelopes. Accuracy cannot tell them apart. Variance can.
Where variance actually comes from
The first useful move is to stop talking about non-determinism as if it were one thing. In production systems we see at least five mechanistically distinct sources of variance, and confusing them is the most common diagnostic error we encounter.
The model
Even with temperature set to zero, language models exhibit small variations from floating-point and batching effects. With any meaningful temperature, outputs diverge. This is the layer most teams notice first and try to control with decoding parameters. It is also, in our experience, rarely the dominant source of variance in a production system.
The prompt
Identical-meaning prompts can produce very different outputs. Researchers have shown LLM benchmark performance shifting by tens of points based on changes a human would consider purely cosmetic, a different separator character, a different capitalization, a different ordering of multiple-choice options. This is not fixable by lowering temperature. It is a property of how the model has learned to weight surface features.
The retrieval
In RAG systems, the answer is at least as much a function of what was retrieved as of what the model said about it. Two retrievals containing the same critical document can yield different answers if the document appears at different positions in the context window, the well-documented "lost in the middle" effect. Embedding model, chunking strategy, reranker, and metadata filters all inject variance here.
The tool policy
Once an agent can call tools, the variance changes character. Two runs may select different tools, supply different parameters, accept or reject intermediate results, and terminate at different points. Both trajectories may succeed, but they will have different costs, latencies, and side effects. This is the variance that converts most directly into operational risk, because tools have consequences.
Time
A system that was within tolerance on Monday may not be on Friday. A vendor silently updated the model. The embedding index was rebuilt with a new model, invalidating prior chunks. The corpus changed. New prompt templates shipped. None of this is sampling variance; it is drift, and it is the variance most likely to surprise an organization that wasn't watching for it.
These sources interact. A small prompt change alters the rewritten retrieval query, which alters the retrieved context, which alters tool selection, which alters the memory written for the next turn.
When something goes wrong in production, the question isn't was the model right, it's which layer's variance produced this outcome.
Most teams can't answer because their observability collapses all five into a single trace of "what the model said."
Making variance measurable
Calling for a "variance budget" without saying how to measure variance is the kind of governance theater the AI field already has too much of. The measurement question has to be operational.
For tasks with structured outputs such as classifications, extractions, code, the natural object is the agreement rate across N runs: the fraction of runs that produce the modal output, or the entropy of the empirical output distribution. For free-form generation, where agreement is ill-defined, we fall back on pairwise semantic similarity across runs, with the caveat that these metrics are themselves noisy. Track them as trends on a fixed evaluation set rather than as absolute values. For tasks with verifiable outcomes, a tool call that either succeeds or fails, a number with a known range, measure outcome variance directly. This is the most informative metric, because it integrates over all the intermediate variation that doesn't actually affect what happened.
Some measurements that decide the rest
Measure
Calibration The extent to which a system's stated confidence matches its empirical accuracy. A system wrong 30% of the time but knowing when it is wrong is governable, you escalate the uncertain cases. A system wrong 10% of the time with uniform confidence is not. Calibration is rarely satisfied off-the-shelf and has to be engineered.
Measure
Behavioural testing Targeted tests probing specific behaviours under controlled perturbations. Does paraphrasing change the answer? Does adding a negation flip the conclusion? For agentic systems: does the agent call tools prematurely, agree sycophantically, or break under retrieval fragility? Probe failure modes, don't sample a "representative" distribution that doesn't exist.
Measure
Outcome variance For verifiable tasks, measure the spread of results, not generations. Two trajectories that close the same ticket count as the same outcome regardless of how differently they got there. This integrates over the noise that doesn't matter and exposes the noise that does.
Measure
Drift, by layer Track agreement rate, calibration, and outcome variance over time, attributed to the layer that changed: model version, prompt template, embedding index, corpus snapshot. A drift signal that cannot be attributed is a drift signal that cannot be acted on.
A note on LLM-as-judge, since it is now everywhere. Using a strong LLM to grade another LLM's outputs is useful, but it has known biases: judges prefer the first answer presented, prefer longer answers, prefer outputs from their own model family. These biases don't invalidate the technique, but they require that judges be calibrated against human review on a sample, that prompts mitigate known biases (randomized order, length-controlled comparisons), and that the judge itself be evaluated for variance. Treating an LLM judge as ground truth is a category error.
A system that knows when it doesn't know is governable. One that doesn't, isn't, and no amount of accuracy compensates.
What's actually new in the architecture
A genre of AI commentary now claims that agentic systems require an entirely new enterprise architecture. Some of this is true. Much of it is the rediscovery of separation of concerns under a new label. The intellectually honest move is to name which is which.
What is not new: keeping commitments, financial limits, identity, and audit in deterministic systems while a probabilistic component handles language and presentation. This pattern is decades old. Every chatbot worth its salt in 2005 routed authoritative actions through a rules engine. Calling this "controlled autonomy" rebrands sound design as innovation.
What is new is the surface area, in three respects.
First, the capability surface scales super-linearly with tools. A traditional system exposes a fixed operation set, each invoked through known code paths by known callers. An agent with k tools can compose them in vastly more trajectories of length n, and the choice among trajectories is made by a probabilistic policy whose decision boundary cannot be statically inspected. This is why "approve the application" is no longer the right unit of architectural review. The unit is the capability: what can be invoked, by whom, with what parameters, with what side effects. The old idea of capability-based security, dating to the 1960s, turns out to be the right primitive, but its application to LLM agents is genuinely new and largely unsolved.
Second, the input is no longer trusted. Classical systems treat user input as untrusted but treat surrounding context configuration, retrieved data, tool results as trusted. LLM agents have to treat all of it as adversarial. The phenomenon now called indirect prompt injection demonstrates that an attacker who controls a document the agent retrieves, or a webpage it reads, can cause the agent to execute attacker-chosen actions. There is no classical analog for a system whose control flow can be hijacked by the data it processes. This isn't a governance problem solvable by policy; it's an open security problem.
Third, behavior depends on the corpus. In RAG systems, the effective program is the union of the prompt, the model weights, the retrieval index, and the documents in the corpus at query time. Change any of these and behavior changes. The corpus is typically managed by a different team than the application, on a different cadence. The architectural implication is that the document lifecycle becomes part of the application lifecycle. Documents need versioning, freshness SLAs, classification, and rollback, treated with the same seriousness as code. Most organizations are not ready for this, because most organizations had not solved enterprise document governance before AI made it urgent.
The right framing is therefore neither "this is just classical architecture" nor "everything is new." Determinism around commitments is old. Capability-bounded agents, prompt-injection defense, and corpus-as-code are new. The literature is still catching up.
The honest cost
The governance prescriptions popular in enterprise AI commentary, comprehensive observability, retrieval governance, tool permissioning, human-in-the-loop, continuous evaluation, are not free. We've watched these costs blow up budgets often enough to think the field needs to talk about them honestly.
| Case · Anonymised | Mid-volume customer-ops agent · EU enterprise |
|---|---|
| Use case | Tier-1 ticket triage and resolution agent across email, chat, and CRM tools. |
| Volume | ~100,000 tasks/day projected at full rollout. Pilot ran at 8,000/day with strong demo numbers. |
| Hidden line items | Eval compute (regression on every prompt change) · 5% sampled human review at €0.40/task floor · variance observability storage. |
| What we found | Governance cost per task exceeded value per task at the projected volume. Calibration was not good enough to safely sample fewer cases. |
| Outcome | Use case re-scoped to the top-3 ticket categories where outcome variance was verifiable. Remaining categories stayed with humans, augmented but not replaced. |
Evaluation has compute cost. Running a meaningful regression suite per prompt change, against multiple model versions, with self-consistency sampling, can dominate the model bill itself. Teams that propose continuous evaluation rarely budget for it.
Human-in-the-loop has throughput cost. A human reviewer at €50 per hour reviewing thirty-second tasks gives an effective floor of roughly €0.40 per task in labor. For a system handling 100,000 tasks a day, even five percent sampling costs €2,000 daily. For low-margin workflows this can defeat the use case. The alternative, review only low-confidence outputs, depends on the calibration property that, as we noted, is rarely satisfied off-the-shelf.
Observability has storage and engineering cost that lacks a corresponding visible feature, which is precisely why teams underestimate it. Retrieval governance has organizational cost: source classification, freshness monitoring, citation policy, and access-aware retrieval are operating procedures requiring humans to maintain them, not features to configure.
The honest implication is that some workflows shouldn't run on LLM-based systems. If the task volume is too low to amortize evaluation, the consequences are too high to tolerate residual variance, and the upside is incremental rather than transformational, the variance budget cannot be made to pencil out.
This is not a failure of AI strategy. It is the framework working as intended.
Its job is to make this conclusion legible, not to greenlight every use case.
A useful diagnostic: compute the ratio of value per task to governance cost per task. When this ratio is below one, the use case fails the budget regardless of how good the model is. We've come to insist on this calculation before any client engagement that involves moving an agent into production. It changes which projects get built and which ones get redesigned.
Where to start
If we had to compress everything we've learned into the first questions we ask before scaling an agentic system, they would be these.
Where is variance valuable, tolerable, unacceptable?
Some processes need creative variance: sales conversations, marketing copy, hypothesis generation.
Others need almost none: pricing, contractual commitments, financial execution.
Decide workflow by workflow, not at the level of the whole system.
What's in the deterministic core?
Which decisions, limits, and commitments must remain outside the probabilistic layer entirely? Get this wrong and the system is unsafe at any speed.
Which layer's variance hurts first?
For a knowledge assistant, retrieval variance usually matters more than sampling variance.
For an agent that closes tickets, tool-policy variance matters most. The answer determines what to instrument first.
Are we measuring calibration, or only accuracy?
A system that knows when it doesn't know is governable. One that doesn't, isn't. Calibration is a first-class metric, not a derived one.
What does the cost equation say?
Honestly. With real numbers, including the human review labor and the evaluation compute, not just the model API bill. If the ratio is below one, redesign, don't ship.
Who owns the corpus?
In RAG systems, the corpus is part of the program. Named owner, named freshness SLA, named rollback path. If the corpus has no owner, the agent has no contract.
Specifying the envelope
Variance is not the enemy. Variance is often where AI creates value: a useful sales agent adapts, a useful analyst surfaces a non-obvious framing, a useful coding assistant explores. The goal is not deterministic AI. The goal is to make the envelope of behavior, the region within which adaptation is allowed, explicit, observable, and accountable, while keeping commitments and irreversible actions outside that envelope.
- Single point estimate on a held-out set
- Eval that ignores prompt and retrieval drift
- Confidence treated as a UI affordance
- Tool calls reviewed like API calls
- Variance attributed by layer, tracked over time
- Calibration treated as a primary metric
- Capabilities, not endpoints, as the unit of review
- Corpus governed with the seriousness of code
The companies that build durable AI capability won't be the ones that adopt fastest or the ones that refuse longest. They'll be the ones that learn to specify the envelope precisely, measure when the system leaves it, and pay the cost of governance where the value justifies it and walk away where it doesn't.
“The right boardroom question is no longer "Is the model accurate enough?" It is: "Have we specified the envelope, and can we tell when the system leaves it?"
That, in our experience, is the actual work.
Manage the cloud, not the point
Two systems with the same accuracy can have entirely different envelopes. Specify and measure the envelope.
Knowing when not to know
A calibrated system is governable. An over-confident one is not, regardless of how often it is right.
Pencil out the governance bill
Cost Eval compute, human review, observability, corpus ops. If value per task does not exceed governance cost per task, redesign or walk away.
Specifying the envelope before your agent reaches production.
We help organisations design, instrument, and govern agentic AI systems: variance budgets, calibration audits, capability-bounded tool design, and the operating model that lets AI products survive contact with real traffic without surprising the people accountable for them.