Agents in Practice: When They Work and When They Fail â€" Xap.es

AI agents generate enthusiasm disproportionate to their current maturity. Not because they are useless — there are cases where they produce real and significant value — but because the gap between what they can do under ideal conditions and what they do in real environments with all their complexity and variability is still large.

This chapter is pragmatic: when it makes sense to use agents, when it does not, and how to design them so they fail in a manageable rather than catastrophic way.

The promise and the reality

The promise of agents: “tell the AI what result you want and take care of something else. When you come back, it will be done.”

The reality in 2025–2026: agents work well on structured tasks with reliable tools and clear objectives. They fail more often than is acceptable on open-ended tasks, variable environments or with tools that produce unexpected outputs.

An agent evaluation study (SWE-bench) on real code bug-fixing tasks shows that the best agents resolve 40–50% of the tasks. For the rest — the majority — the agent fails, gets stuck or produces incorrect results. In a production environment, a 50% success rate may be sufficient if the 50% failure is detectable and manageable. It can be catastrophic if it is not.

When agents work well

Repetitive tasks with predictable structure. The customer service agent that classifies tickets, extracts information and routes them according to defined criteria works well because the process is always the same and errors are detected quickly.

Data processing pipelines. Download data from an API, clean it, transform it, load it into a database, generate a report. Each step is deterministic, errors are clear and the process can be restarted from the point of failure.

Automated web research. Searching multiple sources, comparing information, synthesising. Occasional imprecision is acceptable because the human can review the output before using it.

Software development assistance. Code agents (Cursor, Devin, SWE-agent) work relatively well for well-defined tasks: “add unit tests to this function,” “fix this bug,” “refactor this module to meet this standard.” They are less reliable for architecture design tasks.

Monitoring and alerts. The agent that periodically reviews a data source, detects specified conditions and sends an alert works well because the task is simple and repeatable.

When agents fail

Ambiguous or poorly defined objectives. “Improve our sales process” is an objective that an agent cannot address usefully. It does not have enough context about what is failing, what resources are available or what constraints exist. It will start doing something, but probably not the right thing.

Environments with many external dependencies. If the agent depends on ten different APIs and any of them can fail or change behaviour, the reliability of the entire system is the product of the individual reliabilities — and collapses quickly.

Tasks requiring deep contextual judgement. Negotiating with a supplier, deciding whether to hire a candidate, managing a communications crisis. These tasks require understanding of organisational context, existing relationships and human nuances that the agent does not have.

When the cost of error is high. An agent that can delete files, send emails or execute financial transactions has a very different risk profile from one that only reads and analyses. Errors are irreversible.

Long tasks with many steps. The more steps a task has, the greater the cumulative probability that some step will fail or produce an unexpected result that derails the rest.

Designing for oversight

The most reliable agent design is not the one that fails least: it is the one that fails in a manageable way. That means building oversight and checkpoints from the start.

Principle of least privilege. The agent should only have access to the tools it needs to complete the task. Do not give it write access if it only needs to read. Do not give it access to production if it can do its work in a test environment.

Human-in-the-loop at critical points. Design explicit points where the agent pauses and waits for human confirmation before continuing. “I have identified these three files for deletion. Do you confirm?” is much safer than deleting automatically.

Step limits and budget. Define a maximum number of steps or an API call budget. An agent that has taken 50 steps on a task that should take 10 is probably stuck — better to stop it than let it continue indefinitely.

Logging of all actions. Every decision and every action of the agent must be recorded so you can reconstruct what happened and why. Without logging, debugging failures is almost impossible.

Testing with edge cases. Before deploying an agent in production, test it with malformed inputs, failing APIs, unexpected results. Failures in testing are lessons; failures in production are expensive.

The near future: multi-agent systems

The fastest-growing area in the agent ecosystem is systems where multiple agents collaborate, each with a specialised role.

Instead of a single agent trying to do everything, a multi-agent system can have: a researcher agent that collects information, an analyst agent that processes it, a writer agent that produces the output, and a reviewer agent that verifies quality before delivering to the human.

The advantage is specialisation: each agent can be optimised for its specific task. The disadvantage is complexity: coordination between agents introduces new failure points and makes the system harder to debug.

Frameworks like AutoGen (Microsoft), CrewAI and Claude itself with its tool-use capabilities are making it more accessible to build these systems. But maturity is still limited: most multi-agent systems work well in demos and fail with some frequency in production.

The practical recommendation for 2025–2026: start with simple, single-role agents. Measure reliability before increasing complexity. The value of agents lies in automating tedious work with adequate human oversight, not in total autonomy.

Agents in Practice: When They Work and When They Fail

The promise and the reality

When agents work well

When agents fail

Designing for oversight

The near future: multi-agent systems

Keep reading

Frictionless maintenance

The art of giving feedback: how to say the hard thing without breaking the bond

The Cost of Task Switching: Why Doing Less Helps You Achieve More