For decades, reliability had a clear and comforting definition. Given the same input, a system should produce the same output. If it did not, something was broken. This assumption shaped everything from SRE practices to SLAs, testing strategies, and incident response. It worked because the systems we were operating were fundamentally deterministic.
Inference breaks that assumption permanently.
In AI systems, variability is not a defect. It is an intrinsic behavior. Model updates, retraining cycles, context variation, sampling strategies, and even token-level decisions introduce nondeterminism by design. Two identical requests may produce different outputs, and both may be “correct.” Expecting consistency at the output level is no longer just unrealistic; it is the wrong measure entirely.
Yet many organizations are still trying to apply traditional reliability thinking to probabilistic systems. The result is confusion, misaligned metrics, and a growing gap between what systems are doing and how we judge whether they are behaving acceptably.
This is why reliability needs a new definition that includes semantics.
Semantic Reliability
Instead of asking whether outputs are identical, semantic reliability asks whether outputs remain within acceptable boundaries over time. Boundaries of intent, accuracy, cost, policy, and trust. The question is no longer “did the system respond the same way,” but “did the system respond in a way that was acceptable for this context, under these constraints.”
That distinction matters more than it may appear.
Operational data already shows that enterprises understand, at least implicitly, that inference behaves differently. Organizations report running or actively evaluating an average of five to seven models simultaneously, driven by delivery concerns such as API compatibility, cost optimization, availability, and failover. These are classic reliability pressures, but they now operate at request time rather than deployment time. Reliability decisions are being made dynamically, often per inference, based on conditions that shift minute by minute.
Once reliability becomes dynamic, determinism is no longer the right goal.
Consider cost. In traditional systems, cost was largely a planning problem. Capacity was provisioned, budgets were forecast, and overruns were investigated after the fact. In inference systems, cost is a runtime signal. Token usage, model selection, and routing decisions directly influence spend on every request. Choosing a larger model when a smaller one would suffice is not just inefficient; it is a reliability failure if it violates cost constraints. Cost containment becomes part of staying “up,” even if latency and availability remain nominal.
The same is true for policy and security. A response that is fast, available, and syntactically valid can still be operationally unacceptable if it leaks sensitive data, violates policy, or drifts outside intended use.
Traditional reliability metrics do not capture this. Error rates, uptime percentages, and latency percentiles tell us whether the system responded. They do not tell us whether the response was appropriate. Inference forces us to confront that gap.
Semantic reliability shifts attention to higher-order questions. Does the system stay on topic? Does it preserve user intent? Does it remain within defined policy boundaries? Does it choose models and routes that balance accuracy with cost? Does it degrade gracefully when conditions change? These are not questions that can be answered with a single golden output or a binary pass-fail test.
This is why many existing SRE practices strain under inference workloads. Regression testing assumes a stable expected result. Alerting assumes clear failure conditions. Incident response assumes a discrete fault to remediate. In probabilistic systems, drift replaces failure, degradation replaces outage, and risk accumulates quietly before anything “breaks.”
Observability Data
Observability data reinforces this shift. Adoption of alerting, automation, insights, root cause analysis, and SLO reporting has surged toward near-universal levels over the last few years, not as an academic exercise but as a necessity. Observability is no longer just a mirror held up to the system. It has become runtime fuel, feeding control loops that continuously adjust behavior. That adjustment is itself part of maintaining semantic reliability.
The implications for reliability engineering are significant. Reliability can no longer be owned solely by infrastructure teams or defined purely by platform metrics. It spans application delivery, security, finance, and governance. It requires explicit definition of acceptable semantic bounds and the mechanisms to enforce them at runtime. It also requires acknowledging that some variability is healthy, while some is dangerous, and that the difference is contextual.
Importantly, semantic reliability does not mean lowering standards. It means raising them. It demands clarity about intent, constraints, and tradeoffs. It forces organizations to articulate what “acceptable” actually means in environments where perfection is neither possible nor desirable.
Inference has already arrived as a production workload. The systems around it are being built, whether organizations recognize the shift or not. The risk is not that AI systems are unreliable. The risk is that we keep measuring reliability in ways that no longer reflect reality.
Semantic reliability is emerging as the new standard. The only open question is how long it will take organizations to start measuring it deliberately, rather than discovering its absence the hard way.

