Bitter Lesson, Misunderstood: Why Scale Alone Breaks Software Engineering

Björn Stansvik
Scalability
AI & Software Engineering
TL;DR:

The Bitter Lesson is often misread as “scale beats structure.” A more accurate reading is that probabilistic-only methods outperform rules-only methods only in domains where errors can be averaged out. Software engineering is not one of those domains.

LLMs deliver speed and fluency, but scale alone does not guarantee correctness, security, or long-term coherence as system complexity grows. Prompt-to-app workflows often look right, then fail under real-world constraints.

The right answer is not rules instead of probabilities, or probabilities instead of rules, but a hybrid system: probabilistic models for interpretation and exploration, deterministic mechanisms for structure, invariants, and guarantees, with humans defining and governing the constraints.

Scale gives velocity. Structure gives durability. Reliable software needs both.

What Richard Sutton’s Bitter Lesson actually said

Richard Sutton’s “The Bitter Lesson” is often cited, but less often read carefully. The essay is not a manifesto against structure or human knowledge. It’s an observation drawn from decades of AI research: across multiple domains, approaches that relied on general learning mechanisms and that scaled with compute and data tended to outperform approaches built around handcrafted rules and heuristics.

The key detail is where this pattern holds most strongly. Sutton’s examples largely take place in environments with abundant feedback and repeated trials, where progress can be measured statistically and errors corrected through iteration. In those settings, simple methods that scale often beat systems that try to encode intelligence directly.

That observation is descriptive, not universal. It describes what has worked historically for specific classes of problems. It does not claim that all intelligent systems should discard structure, constraints, or explicit models. And it does not argue that learning alone is sufficient in domains where correctness, guarantees, and long-term coherence are core requirements.

This distinction matters in software development because software is not a perception task. It is construction under constraints: interfaces, invariants, contracts, and downstream blast radius. Feedback is often sparse and delayed. Many failures are only visible under load, at integration boundaries, or in edge-case production paths. The environment does not “average out” mistakes. It escalates them.

The Bitter Lesson is about what scales under the right conditions. It is not a commandment to remove structure everywhere. Confusing those two ideas is what set the stage for the problems we now see when scale-first Transformer-based models are pushed into domains like software engineering, where reliability is not optional.

Richard Sutton’s “The Bitter Lesson” is often reduced to “scale beats structure.” In software engineering, scale without constraints turns speed into fragility.

How the industry misapplied the lesson

As the Bitter Lesson gained prominence, its nuance was gradually compressed into a slogan: rules are brittle, learning is superior, scale solves complexity. Over time, that slogan hardened into an assumption rather than a hypothesis.

You can see the pull of this thinking in today’s AI-assisted development tooling. The surface-level success is real: many tools can generate seemingly reliable code output quickly, scaffold projects, and accelerate iteration. The temptation is to treat that fluency as a substitute for architecture, domain modeling, and explicit constraints. Structure starts to look like friction instead of protection.

In practice, this is most evident in prompt-to-app workflows. The output often looks correct on the surface. Files exist. Endpoints respond. UI states render. But beneath the surface, key properties that make software durable are frequently missing or under-specified: coherent domain boundaries, explicit business rules, enforced invariants, and contracts that remain stable as the system evolves. When the system breaks, the cost is not just the bug, it’s that there is no clear structure to reason about, so repair turns into archaeology.

This is an overcorrection. Learning-based systems are being asked to replace not just brittle heuristics, but the mechanisms that make software reliable in the first place. Interfaces, domain models, and explicit contracts are weakened or omitted with the expectation that scale, iteration, or regeneration will compensate later.

At a small scale, that can feel like velocity. As systems grow, the same approach tends to convert speed into fragility.

Why scale breaks in software engineering (and works elsewhere)

Scale (more data, more compute, more parameters) succeeds when errors can be averaged out, and performance is measured statistically across many trials. In perception-heavy tasks such as image recognition or speech transcription, large datasets and repeated feedback enable learning-based methods to steadily reduce error rates. Those conditions are not the default in software engineering.

Software engineering is different. When you define an API contract, encode billing logic, or design an authorization boundary, you are specifying behavior that must hold at all times, across evolving system states. One wrong branch in a code path can cause outages, security vulnerabilities, or data integrity failures. The environment does not correct these mistakes. It exposes them.

Modern coding assistants make this contrast concrete. Tools like GitHub Copilot, Lovable, Replit, and similar systems can speed up code generation and scaffolding, especially for well-trodden patterns. But when tasks demand architectural judgment, nuanced domain logic, or system-wide invariants, the output becomes less reliable. Even on a common benchmark (HumanEval), an empirical study found that generated solutions were correct only a portion of the time across tools and model versions, despite appearing plausible.

This is not a knock on the tools. It’s a reminder of what scale optimizes for. Scale improves surface-level fluency and pattern completion. It does not automatically ensure system correctness, enforce invariants, or maintain long-horizon coherence.

In software engineering, reliability means invariants hold across change: reproducibility, contract stability, and security properties that do not regress.

LLMs, scale, and where compute begins to fracture

Large language models are a textbook example of what happens when massive scale meets a simple training objective. Next-token prediction on vast corpora produces dramatic improvements in fluency and recall. These models can summarize, translate, and generate source code that looks plausible at first glance. That is scale working.

But software engineering is not primarily a fluency problem. It’s a constraint and coherence problem. When an LLM generates code, it is not inherently grounding that output in an explicit system model, a business rules model, or enforceable invariants. Unless you build those mechanisms around it, the model is doing pattern completion, not contract enforcement.

Where this fractures shows up in a few repeatable ways:

  1. Variance across runs
    Probabilistic generation means you can get materially different solutions from similar prompts. Sometimes that’s helpful. In engineering workflows, it becomes a reliability issue unless outputs are contained,  constrained, validated, and made reproducible.

  2. Local correctness that does not compose
    LLMs often produce snippets that are syntactically correct and sometimes functionally correct in isolation but do not compose into a coherent system. Boundary mismatches, inconsistent abstractions, and contract drift show up at integration time, not at generation time.

  3. AI-generated code security risks (what static analysis keeps finding)
    Empirical analyses of AI-generated code repeatedly find serious issues that look “fine” to a casual reviewer. For example, SonarSource reports that severe bugs such as resource leaks and API contract violations appear consistently across models and highlights high-severity vulnerabilities in generated outputs. Separate academic work also stresses that functional success is not a proxy for security or maintainability and that static analysis is essential for understanding the issue profile.


  4. Supply-chain hazards: slopsquatting and hallucinated dependencies
    When context is incomplete, models can produce plausible-looking identifiers that are wrong. In coding, that can include dependency names. The security community has discussed “slopsquatting,” in which attackers register hallucinated package names to exploit this behavior.

None of this implies LLMs are useless for engineering. It implies that scaling transformer models is not a substitute for the aspects of software development that require enforceable structure. If you want reliability, you need a system that can contain, constrain, validate, and preserve correctness across change.

The sequel to the Bitter Lesson and what it means for systems

The Bitter Lesson showed that methods that scale with compute can outperform handcrafted heuristics in domains where feedback is abundant, and error can be averaged out. Applying that insight broadly to systems where correctness, reliability, and enforceable invariants are requirements leads to predictable fractures.

Large language models are a clear instantiation of this. They deliver fluency and pattern recall at scale. But those gains are rooted in objective simplicity and data volume, not in an architecture that enforces system constraints or carries obligations across a codebase.

Recent work seeks to formalize why certain failure modes persist as models scale, identifying limitations such as hallucination, context compression, reasoning degradation, retrieval fragility, and multimodal misalignment, and connecting them to deeper constraints from computation and information theory.

In software engineering workflows, the limits show up in ways engineers can measure:

  • Hallucinations in code-change-to-text tasks: One study found hallucinations in approximately 50% of generated code reviews and about 20% of generated commit messages, illustrating how structural context can be lost when outputs are produced probabilistically.

  • Security failure rates under constrained prompts: Veracode reports testing 100+ models across 80 tasks designed around common weakness classes, finding that a substantial fraction of outputs failed security tests under their setup.

  • Quality and security issue profiles visible under static analysis: Academic work examining thousands of coding assignments through static analysis argues that functional metrics alone are insufficient to assess production readiness.

These patterns are not flukes. They are symptoms of the same underlying dynamic: probabilistic models do not, by default, enforce invariants or preserve system-level contracts over time. They can be excellent at interpretation and drafting. But without a constraint layer, they cannot reliably guarantee the properties software teams care about most.

That leads to a practical conclusion for decision-makers: the winning architecture is not “scale only.” It is hybrid systems where probabilistic models handle ambiguity and interpretation, and deterministic mechanisms enforce structure, invariants, traceability, and safety. Humans remain central, not as last-minute reviewers of unpredictable output, but as designers of the workflow and custodians of the constraints.

In other words, scaling transformer models remains useful. But in software, scale must be governed.

Our approach at Rosetic

We took the Bitter Lesson seriously. We also took its limits seriously. That combination pushed us toward an architecture different from the industry's default: scale-only.

At Rosetic, we treat AI-assisted software development as a system, not a single model. We separate three roles on purpose:

LLMs handle probabilistic reasoning and interpretation. They are useful when inputs are messy and incomplete, such as product requirements, diagrams, documentation, and partial schemas. This is where flexibility matters.

Our Deterministic Language Model (DLM) enforces structure. The DLM is a deterministic engine that turns validated, structured requirements into repeatable outputs. In our workflow, we use an explicit System Model and compose code artifacts using controlled templates and rules, ensuring results are consistent, complete, and aligned with predefined architectural patterns. This is where deterministic accuracy matters.

Humans orchestrate the system. Engineers decide what the system should do, validate the intent, and refine the model when real-world nuances arise. The goal is not to remove judgment. It is to stop wasting judgment on avoidable model errors.

The System Model is the bridge between intent and code. It is an explicit representation of the system that includes a data model, a UI model, a business rules model, and configuration details. Because it is explicit, it can be validated before anything is generated. Because is structured, it can evolve over time without losing coherence. See the Rosetic workflow here.

This changes what reliability means in practice:

  • Deterministic accuracy where it matters: Structural layers are not guessed. They are composed of validated models and controlled patterns, reducing silent drift between runs.

  • No probabilistic generation at structural boundaries: The DLM does not “make up” interfaces or modules. It composes artifacts from known templates and validated inputs, designed to prevent missing methods, invented functions, or loosely connected components.

  • Engineers in control, end to end: Iteration is model-driven. When requirements change, the model is updated, re-validated, and re-authored. Our Smart ReAuthoring and safe-writing mechanics are designed to preserve user edits so teams do not end up in a regenerate-and-lose-work loop.
This is the practical response to the sequel of the Bitter Lesson. Scale remains powerful, but only when placed inside a system that can constrain, validate, and preserve structure. That is how teams move from messy input to software they can trust.

Closing  Thoughts

If you are a technical leader evaluating AI-assisted development, the decision is no longer whether models can write code. They can.

The decision is whether your workflow can guarantee what matters after the first impressive demo: reproducibility, invariant enforcement, traceability, and a security posture that improves with time rather than decays.

Scale gives you speed and fluency. Structure gives you durability. The winning teams will combine both and treat “trust, but verify” as an engineering principle, not a slogan.

The Bitter Lesson was never “structure is obsolete.” The real lesson is subtler: scale wins in domains that allow it. Software engineering is not one of those domains unless you build the constraint layer that software demands.

References

  1. Richard S. Sutton, The Bitter Lesson (2019).
  2. Burak Yetiştiren et al., Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT (arXiv:2304.10778).
  3. SonarSource, The Coding Personalities of Leading LLMs (blog).
  4. SonarSource, The Coding Personalities of Leading LLMs (paper).
  5. Chunhua Liu, Hong Yi Lin, Patanamon Thongtanunam, Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics (arXiv:2508.08661).
  6. Veracode, Insights from 2025 GenAI Code Security Report.
  7. Veracode, AI-Generated Code: A Double-Edged Sword for Developers.
  8. Muhammad Ahmed Mohsin et al., On the Fundamental Limits of LLMs at Scale (arXiv:2511.12869).
  9. A. Sabra et al., Assessing the Quality and Security of AI-Generated Code (arXiv:2508.14727).
  10. Addy Osmani, The “Trust, But Verify” Pattern For AI-Assisted Engineering.
  11. Trend Micro, Slopsquatting: Hallucination in Coding Agents and Vibe Coding (PDF).
  12. Mend.io, The Hallucinated Package Attack: Slopsquatting Explained.
  13. Dr. Adnan Masood, Inside the AI IDE Boom — How Cursor, Copilot, and Replit Are Redefining the Craft of Code
  14. Kuldeep Paul, LLM Monitoring: Detecting Drift, Hallucinations, and Failures -- Discussion of drift, hallucinations, and failure modes in LLM outputs in production

Scalability
AI & Software Engineering