From Pilot to Production: What We Learned Getting AI Past the Failure Rate
MIT's NANDA initiative recently published research showing that ninety-five percent of generative AI pilots fail to achieve meaningful business impact. Forrester tells a similar story: only ten to fifteen percent of enterprise AI projects make it into sustained production use, with more than sixty percent stalling due to integration issues, data quality problems, and workflow redesign delays.
We read these reports with recognition rather than surprise. The failure patterns they describe are precisely what we have spent learning to avoid. Across deployments in manufacturing, financial services, travel, eCommerce, ESG monitoring, real estate, and social infrastructure, we have developed an approach that addresses the specific failure modes identified by the research—not because we anticipated the research, but because we encountered the failures ourselves and had to find ways to overcome them.
What follows is not a commentary on what others do wrong. This is an account of what we have learned about transitioning AI from demonstration to production, and why this approach matters.
Starting with Operations, Not Technology
The most consequential decision in any AI deployment happens before any technology is selected: whether to start with what AI can do or with how work actually happens.
Most pilots start with the technology. They identify an impressive AI capability, build a demonstration, and then attempt to find a place for it in operations. This sequence is backwards, and it explains why so many pilots succeed in controlled environments but stall when they encounter production reality.
We start with operational mapping. Before discussing AI capabilities, we spend time understanding how work actually flows through an organisation—not the process documentation, which describes how things should happen, but the actual patterns of activity, decision-making, and exception handling. Where does time get consumed? Where do exceptions cluster? Where does institutional knowledge live, and what happens when the people who hold it are unavailable? Where are the handoffs, the bottlenecks, the points where information gets lost or degraded?
This mapping reveals something important: the opportunities for AI are rarely where organisations initially assume. The MIT research found that more than half of AI budgets go to sales and marketing tools, yet the biggest returns come from back-office automation. Our experience is consistent with this. The most tractable opportunities are often in operations—quote generation, compliance workflows, document processing, field intelligence—where processes are structured enough to be automated but complex enough that simple rule-based systems cannot handle the variability.
Operational mapping also surfaces the integration requirements, the data dependencies, and the workflow changes that will be necessary. These are the issues that cause sixty percent of pilots to stall, according to Forrester. By identifying them before building anything, we can design systems that address them from the beginning rather than discovering them when it is time to move from pilot to production.
Designing for Governance from Day One
A pattern we see repeatedly: organisations build AI systems that work impressively in demonstrations, then discover that satisfying governance, compliance, and auditability requirements requires fundamental redesign. The pilot stalls while the system is rebuilt—if it ever gets rebuilt at all.
We design for governance from the beginning because we have learned that retrofitting it does not work. Audit trails, escalation paths, human oversight mechanisms, compliance constraints, behavioural boundaries—these are not features to be added before production. They are architectural foundations that shape every other design decision.
In financial services, this means defining precisely what the system will and will not do before writing any code. It will explain product features; it will not provide advice. It will show historical data; it will not predict returns. It will facilitate transactions that the user initiates; it will not make autonomous decisions about their money. These boundaries are not limitations imposed after the fact; they are design parameters that determine the system's architecture.
In manufacturing, governance means ensuring that every quote generated is traceable—what inputs were used, what logic was applied, what human oversight occurred. If a pricing error emerges months later, the organisation must be able to reconstruct exactly what happened and why.
In newcomer support and social infrastructure, governance means respecting agency—the system acts when asked, confirms before taking action, and knows which questions require human judgement rather than automated response.
The specifics vary by sector and context, but the principle is constant: governance requirements shape architecture. Systems designed this way can move from pilot to production without the redesign delays that stall most deployments.
Building Systems That Learn from Operation
The MIT research identifies a "learning gap" as a core reason enterprise AI fails: generic tools excel for individuals because of their flexibility, but they stall in enterprise use because they do not learn from or adapt to workflows.
This observation points to an architectural requirement that most deployments miss. A system that handles the same transaction the same way on day one and day three hundred has not improved through operation. It is a static tool, not an intelligent system. Its value does not compound.
We build systems designed to learn. Every transaction generates data about what worked, what failed, and what edge cases emerged. This data feeds back into the system's operation—not through periodic retraining, but through memory architectures that accumulate institutional knowledge over time.
When a manufacturing system processes a thousand quotes, it has encountered configurations and customer situations that no initial design anticipated. A system that learns from this exposure handles the thousand-and-first quote differently than the first—not because we updated it, but because it has accumulated understanding of how this specific organisation's products, customers, and pricing logic work.
When a customer support system resolves ten thousand enquiries, the patterns of what works—which approaches de-escalate frustrated customers, which explanations actually clarify confusion, which situations require human intervention—become encoded in the system's operation.
This accumulated knowledge is what creates durable value. The AI technology is commoditised; any competitor can access the same models. The institutional memory built through months or years of operation cannot be replicated. It is an asset that grows more valuable over time, and it belongs to the organisation, not to us.
Moving to Production Early, with Appropriate Oversight
There is a temptation in AI deployment to extend pilots indefinitely—to keep refining in controlled environments until the system is ready for production. This approach feels prudent but is actually counterproductive. The learning that matters most happens in production, not in pilots.
We move systems into production early, with appropriate human oversight. The goal is not a flawless demonstration but a working system handling real transactions in real conditions where genuine learning can occur.
This requires designing for supervised operation from the beginning. The system must be transparent about what it is doing and why. It must surface situations where it is uncertain or where outcomes should be reviewed. It must make human oversight efficient rather than burdensome.
Early production deployment serves multiple purposes. It generates the operational data that enables learning. It surfaces edge cases and failure modes that controlled environments do not reveal. It builds organisational familiarity and trust through experience rather than demonstrations. And it compresses the timeline from investment to value, which matters for maintaining organisational commitment through the inevitable challenges of deployment.
The alternative—extended pilots followed by big-bang production deployment—is how promising initiatives lose momentum and stakeholder support.
Empowering Line Managers, Not Just Central AI Teams
The MIT research specifically identifies this as a success factor: deployments driven by line managers succeed more often than those driven solely by central AI teams.
The reason is straightforward. Central AI teams understand technology but often lack deep knowledge of how work actually happens in specific operational contexts. Line managers understand operational reality—the exceptions, the workarounds, the reasons why documented processes do not match actual practice. Solutions designed without this knowledge often solve the wrong problems or create new friction that undermines adoption.
We structure our engagements to involve operational stakeholders throughout, not just at requirements gathering and final deployment. The people who will use the system or whose work will change because of it are involved in design decisions, in testing, in early production oversight. Their feedback shapes how the system evolves.
This is not stakeholder management or change management in the conventional sense. It is a recognition that operational knowledge is a necessary input to building something that works. Line managers are not obstacles to be managed; they are sources of insight that central teams do not have.
Choosing Partners Who Have Solved These Problems Before
MIT's findings on build versus buy deserve attention: purchased solutions with vendor partnerships succeed about sixty-seven percent of the time, while internal builds succeed only one-third as often.
The gap reflects experience. Vendors who have deployed similar solutions in multiple contexts have already encountered the integration challenges, the governance requirements, and the workflow adaptation needs that internal teams face for the first time. They have built scaffolding—context management, memory architecture, compliance frameworks—that internal teams must construct from scratch.
We do not position this as a reason to choose us over internal development. We position it as a factor organisations should consider honestly when evaluating approaches. Internal builds can succeed, and there are contexts where they make sense. But organisations should enter them with realistic expectations about timelines and success rates.
What matters is not whether to build or buy, but whether the approach—whatever it is—addresses the failure modes that cause ninety-five percent of pilots to stall. Does it start with operational understanding? Is it designed for governance from day one? Does it build systems that learn? Does it move to production early with appropriate oversight? Does it involve line managers throughout?
These questions apply regardless of whether the work is done internally or with external partners. They are the questions that distinguish the five percent from the ninety-five.
The Research Confirms the Pattern
We did not design our approach based on MIT or Forrester research. We developed it through trial and error across sectors and geographies, learning from our own failures and adjustments. The research confirms what we learned through experience: the failure modes are predictable, and the approaches that address them are knowable.
The integration wall that stalls sixty percent of pilots is addressed by operational mapping that surfaces integration requirements before building anything. The governance gap is addressed by designing for compliance and auditability from day one. The learning gap is addressed by architectures that accumulate institutional knowledge through operation. The workflow redesign delays are addressed by progressive deployment that allows processes to adapt incrementally. The disconnect between central AI teams and operational reality is addressed by involving line managers throughout.
None of this is proprietary insight. It is pattern recognition from doing this work repeatedly across different contexts. What is perhaps distinctive is the discipline to apply these patterns consistently rather than shortcuts that seem faster but lead to the stalls the research documents.
What This Means for Organisations Evaluating AI
The ninety-five percent failure rate is not a statement about AI capability. The technology works. The models are impressive. The failure rate is a statement about how deployments are structured.
Organisations that structure deployments differently can reasonably expect different outcomes. The questions to ask when evaluating AI initiatives are not primarily about the AI—its benchmarks, its capabilities, its impressive demonstrations. They are about the approach: Does it start with operational reality? Does it address governance from the beginning? Does it build systems that learn? Does it move to production quickly with appropriate oversight? Does it involve the people who understand how work actually happens?
These are the questions that determine whether a pilot becomes production or becomes another entry in the ninety-five percent.
We have answered them enough times, in enough contexts, to be confident in the approach. Not because we are smarter than others, but because we have been doing this long enough to have learned from our own mistakes and refined our methods accordingly.
The research now documents at scale what we learned through practice. For organisations serious about getting AI into production, that is useful validation. For us, it is confirmation that the patterns we have been following are the right ones.
—
For deploying agentic AI systems across manufacturing, financial services, travel, eCommerce, ESG monitoring, real estate, and social infrastructure, our approach is built around the operational mapping, governance-first design, learning architectures, and line manager involvement that distinguish successful deployments from the ninety-five percent that stall. If you are evaluating how to move AI from pilot to production, we would welcome the conversation.
Mitochondria builds ATP — agentic AI for operations. It learns your workflows, earns autonomy in stages, and runs with governance built in. Your data stays yours. Based in Amsterdam and Pune, working with organisations across Europe and India.
—
https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/
https://economictimes.indiatimes.com/tech/information-tech/forrester-picks-holes-in-its-ai-story-says-just-10-15-pilots-scale/articleshow/127032256.cms