September 15, 2025 · 10 min #AI#systems

On Systemic Errors: Probabilistic LLMs at Scale

When your system is wrong 2% of the time but processes a billion requests.

A language model that is 98% accurate sounds impressive. It is impressive, as a technical achievement. But 98% accuracy at scale is not a rounding error. It is a systemic failure mode that traditional software engineering has no framework to handle.

This essay examines what happens when probabilistic systems operate at scale, why deterministic intuitions fail, and what it means for the systems we are building.

Determinism and Its Misuse

Traditional software is deterministic. Given the same input, it produces the same output. Every time. This determinism is not just a property of the software. It is the foundation of every practice built around it: testing, debugging, monitoring, auditing, compliance, and trust.

When a deterministic system fails, you can reproduce the failure, trace it to a root cause, fix the cause, and verify the fix. The entire apparatus of software quality assurance assumes determinism. Unit tests, integration tests, regression tests: all depend on the same input producing the same output.

Language models break this assumption. The same prompt can produce different outputs. Not because of a bug, but by design. The model samples from a probability distribution. Each sample is different. This means:

You cannot write a test that checks for a specific output. You can check that the output is in a range of acceptable outputs, but defining that range is itself a complex problem.
You cannot reproduce a failure deterministically. A user reports an error. You run the same prompt. You get a different (possibly correct) output. The bug is real but unreproducible.
You cannot guarantee that a fix works. You change the prompt. The output improves on your test cases. But the model’s behaviour on unseen inputs remains probabilistic. You have shifted the probability distribution, not eliminated the error.

This is not a limitation to be overcome. It is a fundamental property of the system. Probabilistic systems do not have bugs in the traditional sense. They have error distributions. And error distributions behave very differently from bugs at scale.

The Scale Arithmetic

Consider the arithmetic. A system that is 98% accurate processes one million requests per day. That is 20,000 errors per day. 600,000 errors per month. Seven million errors per year.

These are not random errors. They are patterned. The model is not uniformly 98% accurate. It is 99.9% accurate on some inputs and 80% accurate on others. The errors cluster around specific input types, specific phrasings, specific domains. This clustering means that certain users, certain use cases, and certain demographics experience far more errors than the average suggests.

A 2% error rate distributed uniformly is manageable. A 2% error rate concentrated on 10% of use cases means that 10% of your users experience a 20% error rate. That is not a minor inconvenience. That is a broken product for one in ten users.

Systemic Errors at Scale

The deeper problem is that probabilistic errors at scale become systemic. They are not random noise. They are structured patterns that affect real people in predictable ways.

The Correlation Problem

In deterministic systems, errors are independent. One bug does not make another bug more likely. In probabilistic systems, errors are correlated. The same underlying model weakness that causes one error causes many similar errors. Fix the model weakness, and a thousand errors disappear simultaneously. But miss the model weakness, and it manifests across every request that triggers it.

Correlated errors are dangerous because they are invisible in aggregate statistics. The overall accuracy is 98%. But for queries about tax law in Germany, the accuracy is 85%. For queries about employment law in France, it is 92%. For queries about contract law in the UK, it is 99.5%. The aggregate masks the pattern.

The Feedback Loop Problem

When probabilistic systems operate at scale, they create feedback loops. Users who receive incorrect outputs adjust their behaviour. Some stop using the system. Some learn to work around its weaknesses. Some, most dangerously, accept the incorrect output as correct and act on it.

Each of these responses changes the system’s effective accuracy:

Users who leave remove themselves from the error statistics, making the system appear more accurate than it is.
Users who work around errors develop compensating behaviours that mask the system’s failure, making it harder to identify and fix the underlying issues.
Users who accept errors propagate those errors into downstream systems and decisions, amplifying the impact of each individual error.

The feedback loop means that the system’s measured accuracy diverges from its real-world impact over time. The statistics improve while the harm compounds.

The Audit Problem

Regulatory frameworks assume determinism. Financial audits, medical records, legal proceedings: all require that actions can be traced to specific decisions based on specific information. A probabilistic system that might have produced a different output on a different day creates an audit trail that is, by definition, incomplete.

This is not a hypothetical concern. It is a present reality for every regulated industry deploying LLMs. How do you audit a decision made by a system that would not make the same decision if asked again? The question has no satisfactory answer within existing regulatory frameworks.

Conclusion

The transition from deterministic to probabilistic systems is not an incremental change. It is a paradigm shift that invalidates many of the practices, intuitions, and frameworks that the software industry has developed over sixty years.

Three principles for operating probabilistic systems at scale:

1. Monitor distributions, not averages. The average accuracy is meaningless. What matters is the distribution: where are the errors concentrated? Which users are affected? Which use cases fail? Build monitoring systems that surface the tails, not the mean.

2. Design for error, not for accuracy. Assume the system will be wrong. Design the product so that errors are detected, surfaced, and correctable. The quality of an AI product is not measured by how often it is right. It is measured by how gracefully it handles being wrong.

3. Build verification into the architecture. Do not bolt verification onto a probabilistic system as an afterthought. Make it a first-class component. Every output should carry a confidence signal. Every high-stakes output should be verified before it reaches the user. The verification layer is not overhead. It is the product.

The software industry spent sixty years building tools and practices for deterministic systems. We are now deploying probabilistic systems using deterministic tools. The mismatch is producing failures that are entirely predictable and entirely preventable.

98% accuracy is not 98% reliable. At scale, it is a structured pattern of failure that demands new tools, new practices, and new ways of thinking about what it means for a system to work. The companies that understand this will build products that earn trust. The ones that do not will build products that lose it, slowly, systematically, and at scale.