AI Agents, Honestly: What They Actually Do and Where They Fail

"AI agents" has become the kind of phrase you say at a conference to sound current. Strip the branding off and the actual thing is simple: you take an LLM, you give it some tools, you put it in a loop, and it keeps going until it either finishes or hits a stop condition. That is the whole architecture. The reason it feels magical is that the tools can be anything (APIs, shell commands, other LLMs) and the loop can run for a long time. The reason it often disappoints is that LLMs are still pattern matchers, and long loops give pattern-matching failures more chances to compound.

We have shipped agents in production, we have watched them fail in expensive ways, and we have a pretty clear picture of where they earn their keep versus where they are a worse version of a hardcoded workflow. Here is the honest account.

What an agent actually is

Three components, none of them complicated.

An LLM. Claude 4.7, GPT-5.1, Gemini 3, whatever. The reasoning engine.
A set of tools. Functions the LLM can call. Reading files, hitting APIs, running shell commands, sending emails, searching the web.
A loop. The LLM picks a tool, the tool runs, the output goes back into context, the LLM picks the next tool, repeat until done.

That is it. The bells and whistles (planners, memory, multi-agent orchestration) are all optional layers on top of this core. Most agents that work in production are this simple core plus careful tool design.

Where agents genuinely work

Agents shine on scoped, repetitive tasks with clear success criteria and bounded action space. The magic word is scoped. If you can describe success in one sentence and the agent has maybe a dozen tools, you are in the sweet spot.

Customer support triage

Read an incoming ticket, classify it, fetch the relevant order history, draft a response, and either send it or escalate. This works because the action space is small, the failure modes are recoverable, and a human reviews anything that gets escalated. We run this shape of agent against our publisher support queue at Airfind: roughly 3,000 tickets a day, 89% auto-resolution rate, near-zero customer complaints. The LLM is not perfect, but the workflow shape absorbs its mistakes.

Test generation

Point an agent at a function, tell it to write tests, let it run them and fix them until they pass. This is a beautifully scoped task: the success criterion is objective (tests pass and cover the code), the action space is just read and write, and the failure cost is small (you read the tests before merging).

Scheduled investigation workflows

Every morning at 8am, an agent runs through: check yesterday's revenue from the data warehouse, compare to a 7-day rolling average, and if any line of business is down more than 10%, investigate by pulling breakdowns and filing a draft Slack message for a human to review. Scoped, repetitive, clear success. Works great.

Where agents fall apart, predictably

The failure modes are as predictable as the successes. All of them are different faces of the same underlying issue: LLMs do not know when they do not know, and long loops amplify that.

Open-ended goals

"Grow my Twitter following" is not a scoped task. Neither is "make my codebase better." Agents given vague goals will find something to do, but what they do will rarely be what you wanted. They are optimizers without a clearly defined objective, which means they optimize for whatever the LLM has been trained to associate with the words in your prompt. This is almost never what you actually care about.

Error recovery from unexpected state

Agents handle expected errors well. They handle unexpected state badly. The classic failure: the API returns a new error code the LLM has not seen before, or a file is in a format the tool does not recognize, or a database has two rows where the agent expected one. A human would notice something is off and ask. Agents double down, try harder, and make it worse.

Anything requiring real judgment under uncertainty

"Should we approve this loan?" "Is this a bug or intended behavior?" "Is this customer being abusive or frustrated?" Agents have opinions on all of these, and their opinions are often defensible-looking, and they are often wrong in ways that are hard to audit. Do not ship agents into decisions that a reasonable person would hesitate on.

Agents are great at the tasks a bright intern can do in an afternoon. They are terrible at the tasks a bright intern would need to ask three follow-up questions about.

A real example that works: automated PR triage

We run an agent that reviews every incoming pull request on one of our repos. Its job is bounded:

Tools available:
- get_pr_diff(pr_number)
- get_files_changed(pr_number)
- read_file(path)
- post_review_comment(pr_number, comment)
- request_changes(pr_number, reason)
- approve_and_merge(pr_number)  // only for trivial changes

For each PR:
1. If the diff is only docs or formatting -> approve_and_merge
2. If the diff touches code -> summarize in a review comment,
   flag any obvious issues (unused imports, missing tests for
   new functions, broken type signatures)
3. Never request_changes unless confidence is high
4. When unsure, leave a comment tagging a human reviewer.

This works because the action space is small, the success criterion is observable (does it catch real issues without being annoying), and the escape hatch (tag a human) is always available. The agent runs on every PR. It catches about 40% of the issues real reviewers would have caught, and it does it in 90 seconds instead of 4 hours. Human reviewers still review. The agent is not a replacement, it is a first pass.

A real example that fell apart: the "autonomous research assistant"

We built an agent whose job was to produce weekly competitive analysis. Given a list of competitors, it would browse their sites, check their changelogs, read their blog posts, and summarize what had changed. It had tools to fetch pages, search the web, and write a Markdown report to a shared drive.

It worked great for two weeks. Then we started seeing strange behavior. The reports started including "analysis" of features that did not exist. The agent would occasionally invent a product announcement based on ambiguous language on a landing page. One week it confidently reported that a competitor had raised a Series C, based on reading a job posting that mentioned "our recent funding."

The postmortem was sobering. The agent had no ground truth. Nobody was reviewing the reports closely because they looked competent. Over time the agent had learned (through self-prompting in the loop) to produce confident-sounding text even when its sources were thin. We killed it. We now have humans doing the same task with LLM-assisted drafting. It takes three hours a week. It is correct.

The lesson: if your agent's output is confident-sounding text that nobody else checks, you do not have an agent, you have a hallucination factory with a cron job.

Design rules that matter

Small action space. Fewer tools, better descriptions, clear when-not-to-use rules.
Observable success criterion.Tests pass. Ticket is closed. Data matches a schema. Not "the report is good."
Bounded loop. Max iterations. Hard timeout. Cost budget per run. Agents that can run forever, do.
Escape hatches."Tag a human" should always be an option. "Stop and ask" should always be an option.
Audit trails. Log every tool call and every decision. When it fails, you need to reconstruct why.

Multi-agent orchestration: mostly overkill

There is a lot of writing in 2026 about multi-agent systems, where a "planner" agent delegates to specialized "worker" agents. Frameworks like LangGraph, CrewAI, and the newer AutoGen 2 all lean into this pattern. Our experience has been: it is almost always worse than a single well-designed agent with a good set of tools.

Every handoff between agents is a chance to lose context. Every additional LLM call is additional latency and cost. In the rare cases where multi-agent genuinely helped, it was because the sub-agents had access to genuinely different tools or data. Most of the time the sub-agents were just "the same LLM with a different system prompt," which is a structure decision, not an architecture. Build one agent. Give it good tools. Revisit multi-agent only when you hit a concrete limit.

Cost and latency, which nobody wants to talk about

A non-trivial agent task is 10 to 50 LLM calls. At Claude 4.7 pricing in April 2026 ($3 per million input, $15 per million output), a run with 40k input tokens and 8k output tokens per step comes out to around $1.20 per run. For a thousand-run-a-day workflow, that is $36,000 a year in inference costs. Real money.

This is where design discipline pays off. Using a smaller model for the classification and decision steps (Claude Haiku at ~$0.80 per million input), and reserving the large model for the actual content generation step, can cut costs by 70% with no measurable quality drop. Most teams we have talked to under-use this pattern because they default to the biggest model for everything. Do not.

Bottom line

Agents are one of the most useful patterns we have added to our stack in 2026. They are also the feature most likely to embarrass you in production. The teams getting the most out of them are the ones treating agents as a specific tool for a specific shape of problem, not as a general-purpose way to "make things more AI." If you have a scoped, repetitive task with a clear success criterion and a bounded action space, build an agent. If you have a fuzzy goal and a vague hope, do not. You will save yourself a lot of sobering postmortems.