Skip to content
Fulkerson Advisors

Strategy

Why 95% of GenAI Pilots Die

Most pilots don't die because the model is bad. They die because nobody answered four operating-model questions before the pilot shipped. Here is the field-tested list, with case texture from inside Fortune 500 AI teams.

Christian Adib
Christian AdibFounder & Managing Partner
13 min readMay 30, 2026

In the autumn of 2024, a Caribbean conglomerate had eleven demand-forecasting notebooks running in parallel across four business units. Nine of them outperformed the legacy ERP forecast on every backtest the data science team ran. Six months later, exactly one of them was in production. The other ten had been quietly archived; not because they stopped working, but because nobody could answer a basic question when the original analyst rotated off the project: who runs this on Monday?

That is the actual failure mode. Not the model. The Monday after the analyst leaves.

MIT's State of AI in Business 2025 report put the number at 95 percent of GenAI pilots failing to reach production. The figure has been quoted, reposted, and re-litigated to the point of cliche. What gets lost in the hand-wringing is that the failure rate is almost entirely explained by four operating-model questions; questions that almost no pilot-stage team asks before they start, and that almost every steering committee asks too late.

The Contrarian Claim

The model is not the bottleneck. It hasn't been for two years. Frontier model performance, retrieval quality, and orchestration tooling have all gotten cheap enough and good enough that any competent forward-deployed team can stand up a credible prototype in a fortnight. The bottleneck is the operating model that surrounds the model: who owns it, what it has to prove to earn more funding, whose P&L gets hit when it's wrong, and how it slots into the workflow it's supposed to displace.

Every pilot I have seen die in the last three years died for the same reason: the team building it answered the technical questions and ignored the operating ones. The technical questions are interesting and tractable. The operating questions are tedious, political, and require someone with seniority to commit to an unglamorous answer in writing. So they get deferred. Then the pilot ships, the demo goes well, the steering committee applauds, and ninety days later the project is dead because the four questions never got answered. They get asked, finally, at the post-mortem; which is the wrong meeting.

What follows is the list. I have watched each of these kill pilots that should have lived, at firms with the talent and the budget to know better. Treat it as a pre-flight checklist. If you can't answer all four in plain language before kickoff, your pilot is on a clock.

Question One: Who Owns the System on Day 91?

Every AI pilot has a champion. The champion is the person who pushed the funding through, recruited the team, and stood up at the all-hands to demo the prototype. The champion is also, statistically, the person most likely to be on vacation, on a rotation, or on a different project ninety days after launch. When that happens, the pilot inherits a successor. The successor inherits a system they did not build, a Slack channel of half-resolved bugs, a vendor contract they didn't negotiate, and a set of design decisions documented mostly in the original champion's head. Most of the time, the successor's first instinct is to put the system into maintenance mode and quietly let it rot. That is rational behavior given what they've been handed.

We saw this clearly inside a medical device manufacturer that had built an LLM-based voice agent for routine patient inquiries. The system was elegant: it integrated with their product documentation, their support knowledge base, and their CRM; it could handle roughly 60 percent of inbound calls end-to-end. The pilot champion was a director of digital innovation who had personally negotiated the model contract and tuned the system prompt. She got promoted into a different function eight months in. The new owner inherited the agent, looked at the operational runbook, and discovered there was no operational runbook. There was a README in a Git repository. The escalation path for a hallucination going out to a patient existed only as an informal Slack DM between the champion and the lead engineer.

We rebuilt the operating documentation in six weeks: a successor brief, a runbook with response-time SLAs by failure mode, a model-update protocol, a clear escalation tree from agent to human supervisor to medical director, and a quarterly review cadence with named owners on each line. None of that was novel. What was novel was that it existed at all. Most pilots ship without it because writing it is unfashionable work; it has no Jupyter notebook and no model card.

The test I now ask every pilot team to pass before they go live: name the day-91 owner, in writing, with their manager's signature, and give them a one-page brief they can hand to their own successor if they themselves rotate. If you can't do that, you don't have a pilot. You have a science fair project with a budget.

Question Two: What Is the Graduation Gate?

Almost no pilot I have audited had a written graduation gate. They had a kickoff deck with success metrics, usually phrased as 'demonstrate feasibility' or 'achieve X percent accuracy on the test set'. Neither of those is a graduation gate. A graduation gate is the specific, pre-committed, written criterion the pilot must clear to earn its next dollar. It has a number, a date, a measurement methodology, and a named decision-maker who will pull the trigger either way.

The absence of graduation gates is why so many pilots enter what I think of as the zombie zone: they don't get killed, but they don't get scaled either. They get a six-month extension, then another, then a vendor renewal, then a quiet line item in next year's budget. At one enterprise software firm we worked with on their GenAI portfolio strategy, we found seven active pilots. Two of them had been running for over a year. None of the seven had a written kill criterion. The CFO was, understandably, beginning to suspect that AI was a cost center with no exit ramp.

We rebuilt the portfolio around explicit graduation gates. Each pilot got a one-page contract: the specific business metric the pilot had to move, the threshold it had to clear, the date of the decision, the dollar cost of continuing versus killing, and the executive sponsor whose signature would commit to either path. Two of the seven pilots were killed within a month; both had been underperforming for quarters but had no formal mechanism to die. One was accelerated because its graduation gate was tied directly to a contract renewal already on the books. The remaining four entered structured iteration with quarterly review.

What I have learned is that the graduation gate is the single highest-leverage artifact in an AI program. It costs nothing to write. It takes maybe four hours of a senior person's time. And it does two things that no demo or dashboard can do: it forces the team to commit, in writing, to what they actually believe the pilot can prove; and it gives the CFO a credible mechanism for capital discipline. The pilots that survive the gate get real funding. The ones that don't, die honestly. Both outcomes are better than zombie.

If you can't write the graduation gate before kickoff, you are not running a pilot. You are running a hobby with stakeholders.

Question Three: Whose P&L Absorbs the Cost of Being Wrong?

Every AI system is wrong some of the time. Even the good ones. The question that determines whether a pilot survives is not the error rate; it is who pays when the error happens. If the error is borne by a P&L that has no stake in the model's upside, the model will be killed at the first incident. If the error is borne by the same P&L that captures the value, the model will be defended through three or four embarrassing incidents while the team irons out the edges. This asymmetry is decisive.

Consider a project we ran with a US retailer operating 850-plus stores, facing a 12 percent labor-cost overrun in Q3. The data science team built a predictive store-profitability model paired with a labor-allocation optimizer. The technical work was strong; the model held up on out-of-sample stores and on multiple seasonal windows. The political work was harder. The model recommended labor cuts in stores where district managers had personal relationships with the store leaders. When the recommendations were wrong, and they were occasionally wrong, the district manager bore the cost of the angry phone call. Corporate finance captured the savings. That misalignment, not the model, was what nearly killed the project.

We restructured the rollout so that the district managers got an explicit savings credit when the model's recommendations played out, and an explicit allowance for override when their judgement disagreed. The model went from being a corporate edict imposed on the field to a tool whose accuracy directly affected the field's own scorecard. Adoption tripled in a quarter. The override rate, counterintuitively, dropped; because the district managers were now incentivized to engage with the model rather than reflexively dismiss it.

The lesson is that incentive design is not something you bolt on to a pilot after the model works. It is upstream of whether the model works at all, because adoption is part of the model's actual performance. A model with 92 percent technical accuracy and 30 percent adoption is, in production, a 28 percent model. The P&L question is how you get from 30 percent adoption to 80 percent: by aligning the cost of being wrong with the P&L that captures the value of being right.

I ask every pilot team to draw a one-page diagram before kickoff: arrows from value capture to error absorption. If the arrows don't meet at the same P&L, you have a political problem masquerading as a technical one. Solve the political problem first. The model is the easy part.

Question Four: What Is the Integration Path?

Integration debt always exceeds model cost. Always. I have never seen an exception. The model is 10 to 20 percent of total program effort. The other 80 percent is plumbing: identity, access management, the SSO integration, the audit log, the legal review, the data residency question, the API rate limit on the system of record, the change management for the workflow the system is replacing. Pilot teams chronically underestimate this. They build the model first, demo it standing alone, and only then discover that wiring it into the actual workflow will take three quarters and a dedicated platform team.

At a top-five US law firm, we built an LLM-agent system that conducts initial client interviews and auto-generates pre-litigation documents. The model was the easy part; we had a working prototype in three weeks. The integration was the project. The system had to read from the firm's matter-management software, write to the document management system, respect privilege and conflict checks, route through the firm's identity provider, and produce outputs that the firm's litigation support team could review under existing quality-control protocols. Each of those was a multi-week workstream with its own stakeholders and its own technical debt. The model is now in production and handling intake at scale. It got there because we mapped the integration path before we wrote the first line of inference code.

The pilot-stage failure pattern is consistent: the team builds against an idealized version of the workflow that exists only in their heads, demos it inside that idealized version, gets approval to proceed, and then collides with the actual workflow. The actual workflow has 14 systems, three vendor contracts, a compliance review, and an end-user population that has been doing it a particular way for nine years. None of that was in the demo. All of that has to be in production.

Before any pilot ships, I make the team write what we call a displacement map: the current workflow as it actually runs, every system it touches, every human handoff, every approval, and every exception path. Then we overlay the new workflow with the AI in it. Where the two maps differ, that is the integration work. If the integration work exceeds three months of platform effort, the pilot is not a pilot; it is a platform project pretending to be a pilot. Reclassify it, restaff it, and stop scoping it on a pilot budget. This single exercise has saved more programs than any model improvement I have ever shipped.

What Graduation Actually Looks Like

A pilot worth scaling is one that has cleared all four questions on paper before the model trained for an hour. The day-91 owner is named, with a one-page brief. The graduation gate is in writing, with a number and a date. The P&L absorbing the cost of being wrong is the same P&L capturing the value of being right. The displacement map is drawn, and the integration work is sized honestly. Only then does the model itself become the binding constraint, which is when the data science team gets to do its actual job.

I have started using a simple test with executives who ask whether their pilot will scale. I ask them to write the four answers on a single sheet of paper, in front of me, in under twenty minutes. If they can do it, the pilot has a real chance. If they can't, the work to be done is not more compute or a better model; it is twenty minutes of honest writing, followed by the harder weeks of getting the right signatures on what they wrote. That work is unfashionable. It does not produce a demo. It is, however, the work that determines whether your pilot lives past day 91.

The 95 percent failure rate is not a story about AI. It is a story about how Fortune 500 organizations buy and deploy any new operating capability when the technology is moving faster than the institution. The technology has stopped being the bottleneck. The institution has not yet caught up. The firms that will scale AI in the next two years are not the ones with the best models; they are the ones whose steering committees have learned to ask the four questions before kickoff, not at the post-mortem. The technology is not the bottleneck. It hasn't been for years. The bottleneck is the operating model that surrounds it; and that is the work the slide deck won't show you.

Frequently asked

Why do most GenAI pilots fail to reach production?
MIT's 2025 research puts the failure rate around 95 percent. In our fieldwork, the cause is almost never the model itself. It is the absence of four operating-model decisions made before kickoff: who owns the system on day 91, what graduation gate it must clear to earn more funding, whose P&L absorbs the cost of errors, and what the integration path into the existing workflow actually looks like.
What is a graduation gate for an AI pilot?
A graduation gate is a pre-committed, written criterion that the pilot must clear to earn its next investment. It includes a specific business metric, a quantitative threshold, a decision date, the dollar cost of continuing versus killing, and a named executive sponsor who will commit to either path. Without one, pilots enter a zombie state: too inconclusive to scale, too established to kill.
How much of an AI program is the model versus the integration?
In our experience across Fortune 500 deployments, the model itself accounts for roughly 10 to 20 percent of total program effort. The remaining 80 percent is integration debt: identity and access, audit logging, legal review, data residency, system-of-record APIs, and change management for the workflow being displaced. Programs that scope on model effort alone consistently miss timelines by multiples.
Why does incentive alignment matter for AI adoption?
Any AI system is wrong some of the time. If the cost of being wrong falls on a P&L that does not capture the upside of being right, the system gets killed at the first error. When you align the two; so that the same operators who feel the errors also feel the savings; adoption rises substantially and override rates often fall, because users now have a stake in making the system better.
How do you handle the day-91 successor problem?
Every pilot needs a named day-91 owner before launch, with their manager's signature and a one-page successor brief they can hand off if they themselves rotate. The brief should include a runbook with SLAs by failure mode, an escalation tree, a model-update protocol, and a quarterly review cadence with named owners. Documentation that exists only in the original champion's head is the most common single failure mode we see.
When should you kill an AI pilot rather than extend it?
Kill it when it misses its written graduation gate and the gate was fair. Extend only when the team can articulate, in one page, what specifically they learned, what they would do differently, and what new gate they are committing to. The discipline is to make killing pilots a normal and respected outcome, not a career-limiting event. Healthy AI portfolios kill roughly a third of their pilots; portfolios that kill nothing are running on inertia, not evidence.

Related topics

AI StrategyPilotsOperating ModelEnterprise AIScaling

Working through a similar problem? We’d be glad to compare notes.