Attention is not enough: transformer fluency vs production reliability.

Björn Stansvik

TL;DR:

The transformer productivity S-curve has reached its second inflection point, hitting a plateau. After an initial burst of exuberance, productivity gains are flattening, and in places where precision matters, errors are becoming more visible. LLMs are exceptional but (at the same time) unreliable as complete engineering systems. The industry keeps trying to push probabilistic models into deterministic workloads, and the results keep breaking. The way forward is hybrid. Deterministic accuracy where structure matters. LLM reasoning where probabilistic creativity is practical. Humans are in control. And teams finally move from building broken prototypes to architecting a production-ready codebase.

The transformer productivity S-curve is flattening, and reliability is the ceiling

Software teams worldwide are running into the same barrier. LLMs deliver impressive fluency, yet the closer you get to production, the more apparent the gaps become. Outputs shift between runs. Reasoning breaks under real-world complexity. Models guess past the edge of their training instead of understanding what the system needs. The industry has quietly entered a phase where adding more parameters yields diminishing returns. You get more eloquence, not more reliability. This is a feature of LLMs, not a bug.

This pattern is visible across engineering organizations. Teams see speed in the early days, but as the codebase evolves, that initial boost gives way to fragility. The models can assist with language and pattern recall, yet they cannot hold the structural intent of an evolving system or predict the downstream consequences of their own suggestions.
This mismatch between probabilistic prediction and deterministic engineering becomes the true ceiling on productivity. It is not that the models are failing; it is that we are asking them to operate outside the domain for which they were designed. Put differently, you need a workflow that separates interpretation from deterministic structure, which we describe in Thinking outside of the LLM box.

This is why leading researchers have begun stating what many practitioners already feel. Ilya Sutskever, co-founder of OpenAI, noted that “the era of ‘Just Add GPUs’ is over,” because scaling alone is not fixing the reliability gap. Yann LeCun, Meta’s chief AI scientist, went further, describing LLMs as “a dead end” for reaching more advanced intelligence precisely because they lack causal understanding and rely on statistical correlations rather than grounded reasoning.

These statements matter not because of who said them, but because they describe what teams are experiencing in practice. The transformer productivity S-curve has flattened. We have extracted most of what scale can offer in the transformer architecture as we know it today. What comes next must be novel architecture, not only incremental compute.

As the industry continues to scale transformers, documented failures in real systems show that fluency without reliability is insufficient.

‍

Hallucinations in code generation are an architectural constraint, not edge cases

Software teams are beginning to understand something that was not obvious in the early wave of excitement. Hallucinations are not strange anomalies or rare misfires. They are a natural byproduct of systems that generate what is statistically plausible rather than what is verified to be true. When a model is built to continue patterns rather than reason about the world, it sometimes presents confabulation—fluent, confident fabrication—as fact.

The deeper issue is not that LLMs make errors. Every tool makes errors. The issue is that LLMs produce answers that sound coherent, authoritative, and complete even when they are entirely ungrounded. In other words, they confabulate. In high-stakes environments such as software development, this gap between style and substance becomes a structural risk.

Hallucinations are not edge cases. They are the design.

Here are some real-world incidents that illustrate the point.

A study by the University of Munich found that using LLMs to generate automotive software from official vehicle signal specifications produced code that appeared correct but contained hidden errors. These mistakes led to non‑compiling code, fake signals, and API misuse that could jeopardize safety‑critical functions.
A study by Snyk showed that when developers relied on AI coding assistants, the models frequently hallucinated non‑existent software packages, leading teams to chase fake dependencies. This opened the door to slopsquatting attacks and caused build failures, security risks, and workflow disruptions.
A multi‑university study on AI‑assisted software development found that while ChatGPT could generate, refactor, and optimize code across diverse tasks, it also introduced frequent hallucinations and security risks. This led to faulty or unstable code and exposed developers to threats such as prompt injection and data leaks.

These examples show different symptoms of the same underlying mechanism. A transformer does not know when it lacks knowledge, does not flag uncertainty, and does not reason about consequences. It simply continues the pattern. In domains where correctness, reliability, or safety matter, a probabilistic generator cannot behave as if it were a deterministic system.

This is why hallucinations must be treated as architectural constraints, not bugs. They expose the limits of an LLM-only approach and reinforce why production-grade systems will depend on hybrid workflows that combine deterministic accuracy and human judgment with the probabilistic creativity of LLMs.

‍

Why software engineering exposes LLM limits faster than any other field

LLMs shine at language, but software is not language. It is structure, contracts, invariants, and side effects. It is a domain where correctness matters more than fluency, and where every suggestion has downstream consequences. This is why the limits of transformers become visible so quickly inside engineering teams.

The productivity gains are real. Developers can complete tasks faster with AI assistance. But episodic speed does not equal soundness or systemic gain over the entire product development life cycle. In a Stanford and GitHub study, developers completed a coding task 55.8 percent faster with Copilot. A separate security analysis found that roughly 40 percent of Copilot’s generated programs contained at least one vulnerability. Faster code is not the same as production-ready code, and professional teams cannot trade reliability for convenience.

Because syntax is easy. System design is not.

Practitioners feel this gap in their day-to-day workflows. This is the same product-shaped gap we call the Autonomous Void: tools and assistants accelerate surface-level output, but the backend and system glue still demand architecture, constraints, and integration. Dr. Robert Chatley, software engineering professor at Imperial College London, describes LLMs as energetic juniors who produce a burst of output with no understanding of architecture or future steps. You save time typing, but pay it back in code review, refactoring, and integration. Productivity becomes elastic. You gain in the moment but lose across the lifecycle.

This tension keeps appearing because the underlying architecture has not changed, and the transformer can only predict the next token. It does not know how to reason about constraints, coordinate multiple layers of a system, or safeguard invariants. The industry keeps trying to make probabilistic models behave like deterministic engines, but that is not what they were built to do. And until the architecture shifts, these limits will remain visible wherever correctness is non-negotiable.

‍

A better path: hybrid AI software development for production-grade systems

The industry is beginning to accept that the next leap in AI will not come from larger transformers. It will come from systems designed for reliability, precision, and control. Scaling alone cannot close the gap between fluent output and trustworthy behavior. Architecture can.

Hybrid approaches are gaining momentum because they mirror how software is actually built. Deterministic components handle the parts of the work that require precision. Probabilistic reasoning helps where intent is ambiguous or information is incomplete. Engineers direct the workflow, choosing the right technique for each stage rather than forcing a single model to be everything at once.

This is the philosophy we have embraced at Rosetic. Our approach combines LLMs, DLMs, and human expertise into a coherent system. The DLM (Deterministic Language Model), is our proprietary generative AI engine for structure. It is designed to produce predictable, verifiable, and repeatable outputs for the layers of software where correctness cannot drift. It enforces rules, patterns, domain models, and architectural constraints with the consistency that production systems demand. Where an LLM might guess, a DLM guarantees.

On top of that foundation, LLMs are used where they add real value. They interpret and structure messy requirements, infer intent from natural language, connect high-level ideas to concrete design, and provide the exploratory flexibility engineers need early in the lifecycle. The human remains the orchestrator and validator throughout the process, following a structured workflow that uses both probabilistic reasoning and deterministic precision when either is most useful.

We see this hybrid model already proving itself in serious engineering environments. Deterministic accuracy forms the structural backbone. Zero hallucinations are enforced in code generation processes where accuracy matters. LLM reasoning is used where interpretation, synthesis, or intent translation is needed. The result is a workflow that keeps engineers in control, rather than forcing them to supervise unpredictable output from a single opaque model.

This is how teams move from messy input to bulletproof output.

‍

Closing insight: the next S-curve is architecture, not scale

The transformer unlocked fluency, but fluency alone cannot carry software into the future. Reliability comes from architecture. Precision comes from structure. Control comes from systems that are designed, not improvised or guessed. The industry is sometimes learning this the hard way as LLM-only workflows strain under the weight of real engineering demands during implementation.

The next chapter belongs to teams that think differently. They will blend deterministic engines with probabilistic reasoning. They will treat AI as a coordinated system rather than a single model type. They will shift from hoping for reliability to designing for it.

This is the path we have chosen at Rosetic. Our hybrid approach, grounded in DLMs for structure, LLMs for interpretation and edge cases, and humans for judgment, reflects a simple belief. Software should be built, not guessed. Architecture should be intentional, not emergent. And AI should elevate engineers, not replace their understanding of how systems behave.

The future of AI-assisted software development will not be defined by scale. It will be defined by teams that build systems with the courage to rethink the foundation. Those who combine precision and flexibility will lead. Those who rely on probabilistic fluency alone will plateau.

The next S curve is already forming. It belongs to the builders who design for trust from the start.

‍

Coming up…

In part two, we’ll examine how this moment echoes a deeper lesson from AI history about compute, structure, and why scale alone eventually runs out.

‍