Google AI: Mathematically Wrong, Empirically Right
Why Google's approach works despite breaking the rules.
Jeff Dean gave a talk at Stanford in late 2025 that deserved more attention than it received. No product launches, no benchmark wars. Instead, a quiet tour through the engineering decisions that built Google’s AI infrastructure over two decades. The thesis, once you stripped away the academic politeness, was striking: Google had repeatedly made choices that violated theoretical best practices and won anyway.
This essay unpacks why.
The Pragmatist’s Confession
The conventional narrative places Google as a theory-first organisation. PageRank was, after all, a linear algebra insight. But Dean’s talk revealed something different: a culture of pragmatic empiricism where theoretical elegance was routinely sacrificed for systems that worked at scale.
The pattern repeated across every major infrastructure decision. MapReduce was not the optimal distributed computing framework. BigTable was not the cleanest database design. TensorFlow was not the most mathematically rigorous ML framework. Each was the system that could be built, deployed, and debugged by real engineers operating under real constraints. The gap between theoretical optimality and engineering reality is not a bug. It is the defining feature of systems that actually ship.
Dean’s implicit argument: when you operate at Google’s scale, the cost of being theoretically perfect but six months late exceeds the cost of being theoretically imperfect but deployed today. Time-to-deployment is itself a variable in the optimisation function, and most theorists leave it out.
Emergent Structure
The most provocative claim in Dean’s talk concerned emergent capabilities. Google’s large language models exhibited behaviours that no one designed and no one predicted. Chain-of-thought reasoning. Few-shot learning. Cross-lingual transfer. These capabilities were not engineered. They emerged from scale.
This is mathematically uncomfortable. Classical machine learning theory says you should be able to predict model behaviour from architecture and training data. Emergent capabilities violate this assumption. They suggest that sufficiently large neural networks develop internal structure that we do not yet have the mathematics to describe.
Dean’s response was characteristically pragmatic: use the capabilities, study them empirically, and let the theory catch up. This inverts the normal scientific workflow where theory precedes application. But it mirrors how most transformative technologies actually developed. Thermodynamics was formalised decades after the steam engine. Aerodynamics came after the Wright brothers flew.
The history of technology is littered with things that worked before anyone could explain why.
Hardware as Forcing Function
A thread running through Dean’s entire career at Google is hardware-software co-design. The TPU (Tensor Processing Unit) was not built to run existing algorithms faster. It was built to make certain algorithms practical that were previously impractical. The hardware did not serve the software. The hardware changed what software was worth writing.
This is a profound inversion. Most companies treat hardware as a given and optimise software within those constraints. Google treated hardware as a design variable. When matrix multiplications became the bottleneck, they built a chip optimised for matrix multiplications. When memory bandwidth became the constraint, they redesigned the memory hierarchy.
The result: Google’s AI stack is not a collection of independent components. It is a vertically integrated system where each layer was designed with knowledge of the layers above and below. This makes it nearly impossible to replicate by assembling best-in-class components from different vendors. The moat is not any single component. The moat is the co-design.
Preserve Don’t Compress
One of Dean’s most counterintuitive positions concerned information preservation. Classical information theory, following Shannon, emphasises compression: extract the signal, discard the noise. Modern neural networks do something closer to the opposite. They preserve vast amounts of seemingly redundant information and let the model learn what matters.
This approach is mathematically wasteful. A well-designed compression scheme should outperform brute-force preservation. But brute-force preservation has an advantage that compression lacks: it preserves information you did not know you needed.
When Google trained models on trillions of tokens, they included data that any reasonable curator would have discarded. Duplicate web pages. Near-identical product descriptions. Slightly different phrasings of the same fact. This redundancy, far from being waste, turned out to be the substrate from which emergent capabilities grew. The model needed multiple perspectives on the same concept to develop robust representations.
The lesson generalises beyond AI: premature compression destroys options. When you do not yet know what matters, keeping everything is a rational strategy.
The Sparsity Inversion
Perhaps the most technically significant insight from Dean’s talk was about sparsity. Dense models activate every parameter for every input. Sparse models, like Google’s Mixture-of-Experts architecture, activate only a fraction. The mathematics of sparse models are worse. The empirical results are better.
Why? Because sparsity provides something that dense models cannot: specialisation without fragmentation. Different expert sub-networks learn to handle different types of inputs. The routing mechanism learns which expert to consult for which query. The result is a model that behaves like a committee of specialists coordinated by a generalist dispatcher.
This is mathematically wrong in a specific sense. The optimal solution to most loss functions is a dense model that uses all available parameters. But the optimal solution is also untrainable, undeployable, and unusable at scale. The sparse solution sacrifices theoretical optimality for practical capability. It trades a few points on benchmarks for the ability to actually serve billions of queries per day.
DeepSeek’s R1 model validated this approach from outside Google: 671 billion parameters, only 37 billion active per query. The sparse paradigm is now the consensus architecture for frontier models, even though the dense paradigm remains theoretically superior.
Distillation Economics
Dean discussed knowledge distillation, the process of training smaller models to mimic larger ones, in terms that revealed Google’s strategic calculus. A frontier model costs hundreds of millions to train. A distilled model costs thousands. But the distilled model captures 80-90% of the frontier model’s capability.
The economics are staggering. If you can serve 90% of queries with a model that costs 1% as much to run, the effective cost per query drops by two orders of magnitude. This is not an incremental improvement. It is a phase transition in the economics of intelligence.
But distillation has a deeper strategic implication. The frontier model is the seed. The distilled models are the crop. You need the frontier model to exist, but you do not need most users to run it. This creates a natural market structure: a few organisations bear the cost of frontier research, and the entire ecosystem harvests the results through distillation.
Google’s strategic position becomes clear: they are not trying to win the frontier model race for its own sake. They are trying to ensure that when distillation happens, the seed model is theirs.
The Capability Trajectory
Dean presented internal data on capability improvements that told a story the public benchmarks miss. Public benchmarks measure performance at a point in time. Google’s internal metrics track the rate of improvement. And the rate of improvement is accelerating.
This is not a claim about scaling laws. Scaling laws describe how performance improves with compute. Dean was describing something different: how quickly the team learns to extract more capability from the same compute budget. Algorithmic improvements, training tricks, data curation techniques, hardware optimisations. Each contributes independently, and they compound.
The implication is that the gap between frontier and non-frontier organisations is widening, not because frontier organisations are spending more, but because they are learning faster. Each training run generates insights that make the next training run more efficient. This is a learning curve advantage, and learning curve advantages compound exponentially.
The Open Question
Dean’s talk ended without a conclusion, which was itself revealing. The honest answer to “where is this going?” is that no one knows. Not because the trajectory is uncertain, but because the capabilities that emerge at the next scale are unpredictable by definition.
What Dean did establish is a methodology: build the infrastructure, run the experiments, observe what emerges, and iterate. This is not how science is supposed to work. Science is supposed to predict before it observes. But frontier AI research has inverted this: observation precedes theory, and engineering precedes science.
The discomfort this creates in the academic community is palpable. Reviewers want proofs. They want guarantees. They want theoretical justification for architectural choices. Google’s response, implicit but unmistakable, is that the empirical results are the justification. The models work. The theory will follow.
Whether this pragmatism is wisdom or hubris depends on a question that cannot yet be answered: are the emergent capabilities we observe robust, or are they brittle in ways we have not yet discovered? Google is betting on robustness. The bet, so far, has paid off. But “so far” is doing a lot of work in that sentence.
Mathematically wrong. Empirically right. And the empirical results, for now, are the ones that matter.