All ideas

Essay

The Shift from Prototype to Production Is Accountability

June 2026·8 min read

Most AI demos look impressive.

  • A question is asked.
  • A model responds.
  • The answer sounds intelligent.

Everyone leaves the room believing the hard part is solved.

But production systems are not judged by how they perform when everything goes right. They are judged by what happens when things go wrong.

Over the last year, I've spent time building AI systems for workflows where trust matters. Not demos. Not prototypes. Systems used by real people making real decisions.

What surprised me most was how little of the work was about the model itself. The model was often the easiest part. The difficult part was everything around it.

  • Routing.
  • State.
  • Context.
  • Caching.
  • Feedback.
  • Evals.
  • Observability.

The production system around the model.

From answering to accountability

In a prototype, the question is usually: "Can the model answer this?" In production, the questions become very different.

  • Did the system choose the right workflow?
  • Did it retrieve the right information?
  • Did it respect permissions?
  • What happens when the user changes their mind halfway through?
  • Can it recover from mistakes?
  • Can we explain why it answered the way it did?
  • Can we improve it after it fails?

Those are accountability questions. And accountability changes architecture.

Users interact with a system, not a model

One of the first lessons we learned was that users are not interacting with a model. They are interacting with a system.

A document question is different from onboarding. Onboarding is different from research. Research is different from a transaction workflow. Each requires different tradeoffs between speed, cost, trust, and human oversight.

The first production decision is often not what to answer. It is where to route.

This is why routing becomes one of the most important components in production AI. Model selection becomes a routing decision. Some requests need no model at all. Some require retrieval. Some require tools. Some require stronger reasoning. Some require human review.

The goal is not to use the largest model everywhere. The goal is to send each request down the right path.

Chat is not the product. State is.

Most AI experiences feel impressive during the first interaction. The challenge begins during the fifth interaction. Or the fiftieth.

When users change information. Ask side questions. Leave and return later. Resume a workflow. Correct prior assumptions.

Without state, conversations feel like isolated requests. With state, they begin to feel like products.

The right context beats more context

Context turned out to be similarly misunderstood. Many discussions about AI focus on retrieval. Can we find relevant information? The harder question is whether that information should be used in this situation, for this user, in this workflow.

The challenge is not retrieval. It is context assembly. Permissioning. Ranking. Budgeting. Selecting what matters and excluding what does not.

In production, more context is not necessarily better. The right context beats more context.

Trust breaks at recovery, not at mistakes

Perhaps the biggest surprise was how users actually break systems. Most failures are not dramatic hallucinations. They are recovery failures.

A user says "good" instead of "looks good." Changes information halfway through a workflow. Asks a side question. Leaves and returns later.

The model may be intelligent enough to answer. But the workflow is often too brittle to recover. Trust breaks not when a system makes a mistake. Trust breaks when it cannot recover from one.

Feedback, evals, and observability

This is where feedback and evaluations become critical. Every correction should become product data. Not a Slack message. Not a support ticket. A permanent improvement to the system.

The best production teams treat evaluations as release gates. Not quality reports. Not dashboards. Release gates.

Finally, there is observability. You cannot improve what you cannot see. Every request should tell a story.

  • What route was chosen?
  • What context was retrieved?
  • What model was used?
  • How long did it take?
  • Did it pass evaluation?
  • Could a human explain the outcome?

Without observability, AI systems become impossible to debug. With observability, they become improvable.

The shift is organizational, not technical

This is why I believe the most important shift from prototype to production is not technical. It is organizational. The question changes.

The prototype asks: Can the model answer? Production asks: Can the system be trusted? Can it recover? Can it improve? Can it be held accountable?

The model answers. The system learns. And increasingly, the teams that win will not be the ones with access to the strongest models. They will be the teams that build systems capable of learning, measuring, recovering, and earning trust over time.

Building an AI-native company where workflow expertise is the moat?

Get in touch