AI Operations

Do not trust AI after one good demo

AI agents need operating discipline after deployment. Logs, traces, evals, gold sets, human review, rollback paths, and feedback loops turn one lucky success into repeatable quality.

2026.06.0311 min readNon-technical founders, operators, and team leads running AI agents after deployment

AI agent operations guide

A good demo proves that an AI agent can succeed once. It does not prove that the agent will keep succeeding after new customers, new tools, new policies, model updates, and messy edge cases arrive. Once an agent touches real work, trust has to come from logs, evals, and a feedback loop.

1. Overview: one success is not production trust

AI agent demos often look convincing because the final output is polished. The agent searches, drafts, calls a tool, and finishes with a confident answer. But production quality is not the same thing as a good first impression.

Anthropic explains that agents are harder to evaluate because they operate across many turns, call tools, modify state, and adapt as they go. A final answer can look correct while the agent used the wrong source, made unnecessary tool calls, skipped a policy check, or solved the wrong problem in a plausible way.

OpenAI tracing docs make the same operating point from the infrastructure side: an agent run should have a comprehensive record of model generations, tool calls, handoffs, guardrails, and custom events. In plain business terms, the company needs receipts.

2. Small dictionary: logs, traces, evals, gold set

A log is the receipt. It records what happened: the input, output, tool call, approval, error, or change. Without a log, the team is arguing from memory.

A trace is the route map. If a log is one receipt, a trace shows the full path of one job from start to finish: first the agent read a ticket, then searched a policy, then drafted a reply, then asked for approval. OpenTelemetry describes traces as the path of a request through multiple services; for AI agents, the same idea helps you see the path through model calls, tools, and human checkpoints.

An eval is a test for the AI system. Instead of saying "the agent feels better," the team gives the agent a fixed task and grades whether it succeeded. Anthropic defines evals this way and emphasizes that agent evals should look at tasks, environments, outcomes, and transcripts, not only final prose.

A gold set is the reference problem set. It is a small bank of real examples the team cares about: hard refund requests, confusing customer questions, edge-case CRM updates, risky tool calls, and past failures. If the agent gets worse on the gold set, the team should not ship the change.

Observability: the ability to see what is happening inside the system through logs, traces, metrics, and dashboards.
Metric: a number you track, such as success rate, human edit rate, tool-call count, latency, cost, rollback rate, or policy-violation rate.
LLM-as-a-judge: using another AI model to grade an AI output. Useful, but it needs calibration and should not be the only source of truth.
Rollback: the ability to undo a change or return to the previous prompt, SOP, tool permission, model, or record state.

3. Logs and traces: show the work, not just the answer

The most dangerous agent errors are often invisible in the final answer. The agent may answer politely while reading an outdated SOP, calling the same tool five times, using a private note it should not have accessed, or skipping a required approval step.

That is why a useful trace should show inputs, retrieved sources, tool calls, changed records, approvals, errors, retries, cost, latency, and final output. This is not only for engineers. A non-technical operator can still read a trace as a simple timeline: what did the AI look at, what did it do, and where did a person step in?

OpenAI also notes that traces can include sensitive data unless configured carefully. So logging is not "store everything forever." The company needs redaction, retention rules, and access controls for the logs themselves.

Minimum log: user request, workflow name, agent version, sources used, tool calls, changed fields, human approval, final output, error state, and timestamp.
Minimum trace: each step in order, with the reason for the step, tool input/output summary, and whether the step was automatic or human-approved.
Minimum privacy rule: do not store secrets, raw credentials, private notes, or unnecessary personal data inside the trace.

4. Evals: turn complaints into repeatable tests

When a user says "the agent got worse," the worst response is to guess. A better response is to turn the failure into a test case. What input caused the problem? What should the agent have done? What source should it have used? What action should it have avoided?

Anthropic says that teams without evals get stuck in reactive loops: fixing one failure, creating another, and being unable to distinguish a real regression from noise. Their guidance is practical for small teams too: start with 20 to 50 simple tasks drawn from real failures, not a giant benchmark nobody maintains.

For Guildex-style operations, evals should measure both the final answer and the process. Did the agent resolve the task? Did it use the right policy? Did it avoid a risky tool call? Did it ask for approval? Did the human need to rewrite the output?

Outcome eval: did the workflow produce the correct business result?
Process eval: did the agent use the right sources, tools, order, and approval gate?
Cost eval: did the workflow use too many tokens, too many tool calls, or too much reviewer time?
Safety eval: did the agent avoid sensitive data, forbidden actions, and overbroad execution?

5. Gold set and scorecard: make quality visible

A gold set does not need to be large at the beginning. It needs to be representative. Ten easy cases and ten hard cases are more useful than a hundred vague examples. The point is to freeze the cases so the team can compare changes over time.

The scorecard should be simple enough that a business owner can read it. A good agent scorecard tracks task success, source accuracy, policy compliance, human edit rate, approval correctness, rollback rate, cost per completed task, and customer-facing risk.

This turns AI improvement into an operating conversation. Instead of "the new prompt feels better," the team can say: success rate improved, but tool calls doubled and human edits did not fall. That means the workflow still has review debt.

Gold set inputs: real past tasks, anonymized customer requests, known edge cases, failed runs, and approval-sensitive scenarios.
Scorecard outputs: pass/fail, reason for failure, human edit notes, source used, tool calls, cost, latency, and rollback decision.
Review cadence: read a sample weekly, rerun the gold set before major changes, and add new failures monthly.

6. LLM-as-a-judge: useful, but not a replacement for judgment

LLM-as-a-judge is useful because it can grade open-ended output at scale. It can check whether a reply follows instructions, whether a summary missed a key point, or whether the tone violates a rule.

But it has limits. The MT-Bench and Chatbot Arena paper found that LLM judges can suffer from position bias, verbosity bias, self-enhancement bias, and limited reasoning ability. So a model judge should not become an invisible manager that everyone obeys.

A safer pattern is mixed evaluation: deterministic checks where possible, LLM judges for fuzzy criteria, and periodic human calibration. In plain language, use AI to help grade the work, but keep a human-owned rubric and a human-owned escalation path.

Good LLM judge use: compare two drafts, flag missing policy sections, check tone, classify failure type, summarize human edits.
Bad LLM judge use: approve payments, certify legal compliance, override human review, or grade high-risk actions with no calibration.
Calibration rule: regularly compare the judge against human reviewers and update the rubric when disagreement appears.

7. Feedback loop: corrections must become system changes

A feedback loop is how the agent improves without relying on memory or luck. The operator reviews a run, labels the failure, updates the SOP or prompt, reruns the gold set, and only then ships the change.

The x-inbox-router signals point in the same direction. SkillOpt frames agent skills as trainable external procedures. Vibe training argues that production agents need evaluators that capture domain-specific failures. OpenHarness presents permissions, hooks, memory, tasks, and observability as the infrastructure around the model. Treat these as social signals, not final proof, but the pattern is consistent: the value is in the operating loop around the model.

The key is bounded improvement. Do not let the agent rewrite its own rules freely. Human corrections should become small, named changes to SOPs, prompts, skills, or tool permissions, then those changes should be tested.

Run happens: the agent completes or fails a task.
Trace is reviewed: the team sees sources, tool calls, approvals, errors, and output.
Failure is labeled: source error, policy miss, tool misuse, tone drift, cost spike, or approval miss.
System is updated: SOP, prompt, skill, permission, rubric, or gold set changes.
Eval is rerun: the team checks that the fix helped without breaking old cases.

8. Failure patterns to avoid

The first failure is output-only evaluation. If you only look at the final answer, you miss wasted tool calls, risky retrieval, hidden policy drift, and review debt.

The second failure is dashboard theater. A dashboard with many charts is not useful unless someone reads samples, labels failures, and changes the system.

The third failure is self-improvement without boundaries. Automatic improvement sounds attractive, but if the agent changes its own instructions without a gold set and rollback path, the company may not know when quality drifted.

Do not treat one successful demo as proof of stable quality.
Do not run evals only when something breaks.
Do not let the same model be the worker, judge, approver, and rule-writer for high-risk work.
Do not collect logs that nobody reviews.
Do not store sensitive trace data without retention and access rules.

9. The Guildex checklist: operate agents like a real workflow

Before an AI agent becomes part of company operations, Guildex would ask for seven artifacts: workflow owner, permission map, log format, trace view, gold set, scorecard, and rollback rule.

The owner says who is responsible. The permission map says what the agent may read, write, and execute. The log and trace say what happened. The gold set and scorecard say whether quality improved. The rollback rule says how the company retreats when the agent gets worse.

This is the difference between "AI helped once" and "AI became a reliable operating asset." The company does not need perfect automation on day one. It needs a loop that makes mistakes visible and turns corrections into better work.

참고자료

Turn agent runs into an improvement loop

Guildex Fit Check maps the workflow owner, permission map, trace, eval set, scorecard, approval gate, and rollback rule before expanding AI automation.