What we've learned running agentic loops in production

A year ago, "agentic AI" mostly meant demos. Somebody would show a loop where GPT-4 called a few tools, checked its own output, and then declared the task done. It looked impressive. It rarely held up under real conditions.

We've been building and running production agents for clients since early 2025: not demo loops, but systems that run nightly, handle exceptions, and get reviewed by a real person who has opinions about whether the output is correct. Here's what we've learned.

Small scope wins every time

The agents that have survived in production all have one thing in common: they do one thing. A nightly brief that pulls yesterday's Shopify numbers and writes a three-sentence summary. A weekly digest that scans a client's inbox for emails that need a follow-up and surfaces them in Slack. A content QA pass that checks new blog posts against a style guide and flags violations.

The agents that failed were the ones we designed to handle a whole workflow end-to-end. Too many decision points. Too many ways to get stuck. When something went wrong (and something always went wrong) it was hard to know where in the chain the failure had happened.

Start with one input, one transform, one output. You can chain them later once each link is trustworthy.

The model is not the hard part

Every client we work with initially wants to talk about which model to use. Should we use Claude? GPT-4o? Gemini? The honest answer is that for most production agent tasks, the model choice matters less than the scaffolding around it.

What matters more:

How the task is specified. Vague instructions produce vague outputs. The best-performing agents have prompts that were written, tested, revised, and tested again against a set of real examples.
How errors are handled. What happens when the agent can't find what it's looking for? Does it fail loudly or silently? Loud failures are almost always better.
How outputs are validated. For anything consequential, a second pass (either another model call or a simple rule-based check) is worth the latency cost.

Human review is not a failure mode

The framing that agents should eventually replace human review is wrong for most early-stage use cases. The goal is not to remove the human. It's to make the human's time go further.

An agent that does 80% of the work and surfaces a clean decision for a human to make in thirty seconds is more valuable than an agent that tries to do 100% and gets it right 85% of the time. The latter creates more work: you have to audit everything because you don't know which 15% is wrong.

We design every agent with a review step. Sometimes that review step becomes a formality after three months because the error rate has dropped to near zero. Sometimes it stays substantive. Either way, it was the right starting point.

What's actually different in 2026

The thing that has genuinely changed in the last twelve months is reliability. Claude 3.5 and GPT-4o, running in mid-2024, would occasionally just stop following instructions in a long context or confuse themselves mid-task in ways that were hard to predict. The current generation is meaningfully better at holding a plan across a long context and at recognizing when it's stuck rather than plowing forward with a wrong answer.

That doesn't mean you should trust agents with consequential tasks unsupervised. It means the baseline is higher and the scaffolding you need to build is somewhat lighter.

The part nobody talks about

The operational overhead of running agents in production is real and underestimated. You need somewhere for logs to go. You need alerts when something breaks. You need a way to replay a failed run with the same inputs. You need documentation that a person who isn't you can use to understand what the agent is supposed to do and how to tell if it's doing it.

None of this is glamorous. All of it is what separates a production agent from a demo.

If you're thinking about where to start with agents in your own stack, I'm happy to talk through it: get in touch.