Arize AI https://arize.com/ AI Observability & LLM Evaluation Platform Fri, 29 May 2026 13:31:07 +0000 en-US hourly 1 How to build a better agent harness with traces and evals https://arize.com/blog/improve-ai-agents-traces-evals-harness/ Fri, 29 May 2026 13:31:07 +0000 https://arize.com/?p=28825 Agents are easy to prototype and hard to improve. A repeatable loop of traces, evals, failed-span inspection, and targeted harness changes makes agent behavior easier to debug and improve.

The post How to build a better agent harness with traces and evals appeared first on Arize AI.

]]>
AI agents are easy to prototype and hard to improve. A better prompt may help, but most durable improvement comes from the harness around the model: the tools it can call, the context it receives, the traces it emits, the evals it runs, and the review gates that decide what changes safely ship.

An agent improvement loop makes that harness better over time. Trace each run, evaluate specific spans, inspect failures, decide whether the agent or evaluator is wrong, update the prompt, rubric, tool, context, or eval, and run the loop again.

In a live demo with Aakash Gupta, Arize AI cofounder and CPO Aparna Dhinakaran showed this workflow using a PM agent for Arize Phoenix. The agent pulled GitHub issues, discussions, and releases, scored product feedback, generated a report, and then used traces and evals to understand where the system failed.

The first run was useful, but it didn’t answer the more important question: why did the agent make those decisions?

When the report is wrong, you need to know where the failure happened. Did the agent fetch the wrong data? Miss important context? Pick the wrong tool? Stop too early? Apply the wrong scoring rubric? Or did the eval judge the output incorrectly?

Most agent projects stall here. Teams can inspect the final answer, but they can’t reliably replay the agent’s trajectory, evaluate the right behavior, and turn failures into a concrete change.

A better pattern is to treat agent improvement as an engineering loop: trace the run, evaluate behavior from those traces, inspect failed spans, decide whether the agent or the eval is wrong, improve the harness, and run it again.

TL;DR: Better agents come from better harnesses

Agents are easy to prototype and hard to improve. The fix is not a better prompt by default. Trace the run, use those traces to create one targeted eval, inspect failed spans, decide whether the agent or eval is wrong, then improve the prompt, tools, context, rubric, or evaluator. Better agents come from a repeatable loop: trace, evaluate, debug, refine, and run again.

Start with an agent that does real work

Start with a task you can inspect.

In the demo, the task was a PM agent for Phoenix. It read product feedback, scored what mattered, and turned that into a priority report. The first version didn’t need every possible source of context. GitHub issues, GitHub discussions, and recent releases were more than enough.

The important thing is your workflow has clear units of behavior:

  • Fetch recent issues, discussions, and releases
  • Score each issue or discussion by priority
  • Write a short rationale for each score
  • Generate a markdown report with top pain points, feature requests, recurring themes, shipped items, and recommended P0 to P3 priorities

That gives you something concrete to debug. If the report is wrong, you can inspect whether the agent fetched the right issues, missed an important discussion, overweighted reactions, underweighted customer impact, or introduced a recommendation that was not supported by the source data.

A useful starter prompt might look like this:


Build a PM agent for this product.

Use recent GitHub issues, GitHub discussions, and releases as input.

For each issue or discussion, score its priority based on:
- whether it is a bug or feature request
- number of comments and reactions
- recency
- customer impact
- relationship to recent releases

Generate a markdown report with:
- top pain points
- top feature requests
- recurring themes
- recommended priorities from P0 to P3

This works because the agent isn’t being abstractly asked to “understand users.” Instead, it’s being asked to collect feedback, score each item, explain the score, and synthesize the results. Each step can be traced and evaluated.

Over time, you can expand the context to Slack, Discord, Gong transcripts, product analytics, support tickets, user interviews, and social feedback. But start narrow. Get one workflow running, trace it, evaluate it, and add more context once you understand where the agent fails.

Trace every step before you write serious evals

Don’t start designing eval from a blank page. You should always start from traces.

For agents, the final answer is not enough. You need the path the agent took to produce it: data fetched, tool calls made, LLM calls made, intermediate outputs created, and how each step led to the final report.

In the demo, the agent pulled 40 discussions, 60 issues, and 8 releases before scoring each item and generating the report. A simplified trace might look like this:


Trace: generate_pm_report
Span: fetch_github_discussions → 40 discussions
Span: fetch_github_issues → 60 issues
Span: fetch_recent_releases → 8 releases
Span: score_issue_priority → priority score + rationale for each item
Span: synthesize_report → markdown PM report

Each step is a span, and the full run is the trace.

If a privacy issue is ranked P3, you can inspect the relevant spans. Maybe the agent never fetched the right comments. Maybe it fetched the issue but scored it incorrectly. Or maybe it scored the issue correctly but dropped it during synthesis. Without traces, you don’t know the answer and you’ll end up rewriting prompts blindly.

Use traces to pick one eval, then criticize it

Once traces are flowing, ask a more specific question: what behavior should this agent be evaluated on first?

In Aparna’s demo, Claude initially suggested report-level evals: groundedness, priority alignment, and actionability. Those are useful, but they’re coarse. Aparna pushed the eval down to the issue level: for each issue the agent scored, did it assign a reasonable priority?

That eval maps directly to the behavior that drives the report. If the agent misscores individual issues, the final report will be wrong even if it’s well written.

A priority-accuracy evaluator might inspect:

  • the issue or discussion
  • metadata like comments, reactions, recency, and labels
  • the agent’s assigned priority
  • the agent’s rationale
  • the team’s prioritization rubric

Then it returns a judgment: accurate or inaccurate, with a reason.

Here’s an example: if a bug affects active users and has multiple confirmations, your rubric may say it should never be P3. If the agent repeatedly gives those bugs low scores, the eval should catch the pattern.

That’s the key shift: evals should emerge from traces, not abstract guesses about what might go wrong.

The first eval is a draft. It gets you into the loop, but still needs human criticism.

A generated eval may flag useful failures. It may also misjudge behavior. Your job is to review a small set of passed and failed spans and ask: was the agent wrong, or was the eval wrong?

This matters because “priority accuracy” is product judgment. One team may prioritize enterprise customer bugs above everything else. Another may prioritize open-source adoption blockers. Another may care most about roadmap-aligned feature requests. The eval needs that judgment encoded clearly.

A good review loop should be simple: run the eval, filter for failures, inspect the relevant spans, then ask:

  • Did the agent misunderstand the issue?
  • Did the agent miss important context?
  • Did the evaluator apply the wrong rubric?
  • Did the evaluator overreact to a weak signal?
  • Did the evaluator fail to account for product strategy?

From there, revise the eval, revise the agent, or both.

Failures are useful when they create a path to improvement. If every output passes, your eval may be too weak. If every output fails, your eval may be misaligned. The useful middle is a healthy mix of passes and failures that you can inspect, categorize, and act on.

In the PM-agent example, failed evals might show that the agent consistently under-ranks bugs. That gives you a concrete fix: update the scoring rubric, add examples of historically escalated bugs, or add product analytics and support data so the agent has more context about impact.

The loop becomes:


Run the agent.
Trace the run.
Evaluate the spans.
Inspect failed evaluations.
Identify the failure pattern.
Update the eval, the prompt, the tools, or the context.
Run the agent again.

That is the practical version of self-improvement. The agent doesn’t magically get better. The system gets better because traces expose behavior, evals turn behavior into measurable signals, and humans refine the policy behind the loop.

Run two loops: the agent loop and the improvement loop

In the live session, Aparna showed two key loops:

  • The agent loop runs on a schedule: fetch new feedback, score issues, and generate the report.
  • The improvement loop reads failed evals, opens the relevant traces, groups failures, and proposes the smallest safe change to the agent, evaluator, data sources, or tools.

For the PM agent, you could ask:


Find all spans where the priority accuracy eval failed.

Group the failures by root cause.

For each group, recommend whether we should improve:
- the agent prompt
- the scoring rubric
- the evaluator
- the data sources
- the tool implementation
- the tool sequence or stopping criteria 

Draft the smallest safe change that would improve the next run.

This is where the workflow stops looking like a dashboard and starts looking like an engineering loop. The agent can consume traces, read eval results, cluster failures, and suggest changes.

The improvement agent isn’t just reading final outputs. Instead, it’s reading the trace: which tools were called, what each tool returned, which LLM calls happened, which spans failed, and how those steps led to the final output. That trajectory is typically where the real bug lives.

The human still decides what ships. For an internal PM agent, the loop can move quickly. For a production agent, proposed changes should go through review, especially changes to prompts, tools, routing policy, and evals. Eval changes deserve extra scrutiny because they redefine what “good” means.

That is the governance layer: automate analysis and recommendations, but keep humans responsible for policy, review, and deployment.

Make observability part of the harness

Observability is usually framed as a debugging tool for humans. For agents, it also becomes input data for the improvement loop. Traces become evidence, evals become checks, and failure clusters become improvement tasks.

That’s why the architecture matters. If trace data is trapped in a dashboard, the loop stays manual. If traces and eval results are available through APIs, CLIs, and standard formats, agents can consume them directly.

That’s what happened in the demo: Aparna’s Claude Code skills could call APIs, fetch traces, inspect spans, and use that data to suggest evals and improvements.

This doesn’t remove humans from the loop. But it does change where humans add leverage: defining rubrics, reviewing changes, inspecting ambiguous failures, and deciding what level of autonomy is safe.

Prompts matter, but the harness is where most agent improvement happens. The harness controls context, tools, state, retries, routing, memory, evals, and review gates. If an agent underperforms, the fix might be a prompt change, but it might also be a better tool, a new retrieval step, a stricter policy, a different scoring function, or an improved eval.

In other words, self-improvement is really harness improvement.

In the PM-agent example, GitHub data is enough to prototype the workflow but will eventually hit limits. Customer calls, support tickets, sales notes, product analytics, and community discussions may all change how an issue should be prioritized.

The same pattern applies to support agents, coding agents, research agents, and in-product assistants. The model is one part of the system. The harness determines whether the model has the right context, whether behavior is observable, whether outputs are evaluated, and whether the system can improve safely.

A practical starter workflow

Start with the two-hour version and pick a repetitive internal workflow where the cost of failure is low and the learning value is high.

Pick something that already takes a few hours each week: triaging GitHub issues, summarizing release feedback, drafting support insights, reviewing sales call themes, or generating a weekly product report.

From there, you can:

  1. Build the simplest agent that can perform the task
  2. Give it one or two data sources
  3. Trace tool calls, LLM calls, intermediate decisions, and final outputs
  4. Pick one eval tied to one behavior
  5. Run the eval on real traces
  6. Inspect failed spans
  7. Make one improvement
  8. Run the loop again

Here’s a basic rubric (pun intended) to consider depending on what you’re building:

  • Triage agent: priority accuracy
  • Support agent: answer groundedness
  • Research agent: citation correctness
  • Coding agent: requirement satisfaction

That single cycle teaches more than a generic eval framework because it is grounded in your agent’s actual behavior.

What changes when this moves into production

Production agents use the same loop with stricter review boundaries.

The agent should still be instrumented, evals should still run on real traces, and failures should still feed improvement workflows. The difference is production systems need a clear separation between analysis and action.

A safe production loop can automatically detect failed interactions, retrieve the relevant trace, run evals, cluster root causes, and draft a proposed fix. Shipping that fix should depend on the risk of the change.

For example, a thumbs-down on an in-product agent response could trigger a debug workflow that retrieves the trace, checks whether the eval or agent was wrong, and proposes a fix. That proposal should still go through review before it changes production behavior.

Low-risk changes might update documentation, add examples to an eval dataset, or create a ticket. Higher-risk changes should require review before modifying a prompt, tool policy, routing rule, or production agent behavior. Eval changes deserve special care because they alter the definition of quality.

The goal is shortening the distance between production failure and safe improvement.

Make agent improvement a repeatable system

The best teams are moving from one-off debugging to repeatable improvement workflows. Every run should produce evidence you can use: traces that show what happened, evals that identify where behavior broke down, and failed spans that point to the next change.

That’s the role of the harness. It gives the agent the right context, records the path it took, runs checks against the behavior you care about, and creates a safe place to propose changes before they reach production.

This is how agent systems become more reliable over time: not through a single better prompt, but through a workflow that makes failures visible, reviewable, and actionable.

For developers and AI engineers, here’s a practical takeaway: don’t wait until your agent is “finished” to add observability and evals. The first useful version of an agent is exactly when you should start collecting traces. Those traces will show you what to evaluate. The evals will show you what failed. The failures will tell you what to improve.

Build the agent, trace the run, create one eval from real behavior, criticize the eval, inspect failed spans, improve either the agent or the evaluator, and run the loop again.

That’s how you move from a prototype that works once to an agent system you can keep improving.

Trace, evaluate, and improve your agent harness with Arize

Learn more in our Docs >

The post How to build a better agent harness with traces and evals appeared first on Arize AI.

]]>
From production traces to better AI agents: Automating the LLMOps feedback loop https://arize.com/blog/from-production-traces-to-better-ai-agents-automating-the-llmops-feedback-loop/ Wed, 27 May 2026 14:17:59 +0000 https://arize.com/?p=28662 Production AI traces are the raw material for better evals, prompts, datasets, and fine-tuned models. This post shows how the Arize AX Airflow Provider turns that feedback loop into scheduled, monitored LLMOps pipelines.

The post From production traces to better AI agents: Automating the LLMOps feedback loop appeared first on Arize AI.

]]>
If you’ve ever shipped an LLM-powered application to production, you know the real work starts after deployment.

Evaluations need to run on fresh data, prompts need to be tested against baselines before promotion, and your datasets need to stay current with production traffic. When a model changes, retrieval quality might drop or a prompt regression may slip in. And that means you need your evaluation workflow to run automatically before the issue reaches users.

Most teams cobble this together with cron jobs, custom scripts, and Slack reminders. That works for a prototype, but breaks down once evals become part of your release process.

Today, we’re releasing the Arize AX Airflow Provider, an open-source Apache Airflow provider that brings Arize AX into your orchestration layer.

The provider includes 95+ operators, 8+ sensors, and 19+ example DAGs for common workflows including span export, dataset refresh, experiment comparison, prompt promotion, drift detection, annotation queues, and CI/CD gates.

Instead of building custom glue code between your evaluation platform and orchestration system, you can define these workflows directly in Airflow.

Let’s jump in.

Why production AI (and self-learning agents) need feedback loops

The industry is moving toward AI agents that improve themselves through production experience. Not in a hand-wavy “AGI will figure it out” way, but in a concrete, engineering-driven way where every production trace becomes a learning opportunity.

Yet even as we build towards this goal, we also know that today’s production AI systems improve when teams can turn real failures into eval coverage. That requires a concrete loop: observe behavior in traces, evaluate it, identify failures, add representative examples to datasets, test candidate changes, and gate deployment.

Here is the feedback loop this provider is designed to automate (and that makes self-learning agents practical):

  1. Observe: Agents run in production, generating traces across every tool call, reasoning step, and LLM interaction. Arize AX captures these as structured spans with full context.
  2. Evaluate: Automated evaluators score each trace on dimensions that matter: accuracy, tool-calling correctness, goal achievement, hallucination risk. The Eval Hub in Arize AX runs these evaluations continuously instead of just at deploy time.
  3. Identify failures: When evaluations surface regressions, a retriever returning irrelevant context, a planner choosing the wrong tool, a response contradicting source material, those failures become test cases.
  4. Curate datasets: Failed traces and edge cases get promoted into evaluation datasets. The ArizeAxSmartDatasetRefreshOperator automates this, pulling diverse production examples into your golden dataset so it stays representative of real traffic.
  5. Improve: Updated prompts, fine-tuned models, or revised tool configurations are tested against the enriched dataset before deployment. The comparison operators verify that the fix actually works and doesn’t regress other dimensions.
  6. Deploy with gates: Only changes that pass evaluation gates make it to production. The CI/CD gate operators enforce this programmatically with fewer manual handoffs and no unvalidated changes reaching production.
Arize AX feedback loop from production traces to evals, prompts, datasets, and model improvements

This isn’t a one-time workflow; it’s a continuous loop. And the key insight is you can’t run this loop manually at scale.

When you have dozens of agents, hundreds of prompts, and millions of daily traces, the evaluation-improvement cycle needs to be automated. That’s where Airflow comes in.

Production AI traces are the raw material for better evals

Traditional Application Performance Monitoring (APM) systems surface infrastructure signals like 500 errors, latency spikes, and timeouts. They tell you when something is slow or broken, but they don’t tell you whether the system is actually doing its job.

LLM observability data is different. A trace can capture the user request, retrieved context, model response, intermediate tool calls, latency, token usage, evaluator scores, and final outcome. For AI systems, that trace is more than a health signal. It is evidence of how the system behaved.

There’s also a structural reason this data matters more for agents than for traditional software. With code, the source of truth is the code itself, every decision point is written down, every path is visible, and you can predict behavior by reading the file. But agents don’t work that way.

At runtime, the agent decides which tool to call, whether to retry, when to hand off, and how to interpret what came back. The same question can take a different path on Tuesday than it did on Monday. You cannot read an agent the way you read a program; its real behavior only shows up after the fact in the trace. The trace is the new source of truth, and the only place the actual system behavior is visible.

That changes what a “bug report” looks like. When an evaluator flags a hallucination, you don’t just get an alert, you get the exact input, the insufficient context that was retrieved, the flawed output, and an assessment of why it went wrong. Feed that failure into your evaluation dataset and the next version of your agent is tested against the precise error that hit production.

Repeat this cycle enough times and your system genuinely learns from its mistakes not through any clever self-improving architecture, but because you’re treating observability data for what it really is: the most specific training signal your team will ever access. APM tells you the house is on fire; but LLM observability data can tell you how to build a fireproof house.

Why Airflow for LLMOps?

The same properties that make Apache Airflow useful for data pipelines also apply to LLMOps: dependency management, scheduling, retries, alerting, audit trails, are exactly what LLMOps workflows need.

For instance, consider this: your nightly evaluation pipeline is a DAG. Your prompt promotion workflow and drift detection loop are DAGs, too. And the agent self-improvement cycle described above? That’s a DAG too.

The Arize AX Airflow Provider closes this gap. Instead of building custom integrations, you get operators that plug directly into your existing Airflow infrastructure and speak natively to the Arize AX platform. The feedback loop that makes agents self-improving becomes a scheduled, retried, monitored, auditable pipeline, not a hope.

There’s a deeper reason this combination works, and it’s about open source. Open source wins. It always has, and the projects that survive, Linux, Kubernetes, PostgreSQL, Kafka, become the substrate everything else gets built on. Apache Airflow is that substrate for orchestration, maintained by the Apache Software Foundation with thousands of contributors and a battle-tested core.

This aligns directly with how Arize thinks about the AI stack more broadly.

  • Arize Phoenix, Arize’s open-source LLM observability platform, drives the developer experience of tracing and evaluation across the OSS community.
  • OpenInference, the open instrumentation spec for capturing agent and LLM traces, is built on OpenTelemetry and adopted across the AI tooling ecosystem.
  • The Arize AX Airflow Provider, extends that same philosophy into the orchestration layer.

Instrumentation, observability, evaluation, and now orchestration, the full LLMOps stack is open by default. You can adopt any piece independently or wire the whole thing together. Either way, nothing about how you operate your pipeline is locked behind a proprietary box.

Why Arize AX for LLMOps?

If Airflow runs the workflow, Arize AX defines and manages the evaluation system: what gets evaluated, how it is scored, where results are reviewed, and which actions happen next.

This topic is well described in the post: What Is An Evaluation Harness is the standardized infrastructure that turns evaluation from one-off scripts into a repeatable operational system. It’s the difference between a team eyeballing model outputs in a notebook and a team that systematically catches regressions before they ship.

Arize AX is built around exactly that pattern with three stages that map directly to operators in this provider.

  • Inputs flow in from production traces (captured via OpenInference), dataset examples, and experiment runs, available through operators like ArizeAxListSpansOperator, ArizeAxExportSpansToDataframeOperator, and the full dataset suite.
  • Execution: happens through LLM-as-judge evaluators in Eval Hub, code based checks, or custom evaluation functions, triggered via ArizeAxTriggerTaskRunOperator and managed through the evaluator and task operators.
  • Actions route to annotation queues for humans, threshold alerts, CI/CD gates, or downstream experiments, expressed through built-in flags like fail_on_regression=True and min_score=0.7, plus operators like ArizeAxPromotePromptOperator and the annotation queue operators.

Airflow doesn’t replace this. It operationalizes it. The eval harness defines what “good” looks like and how to measure it. Airflow makes that measurement run on a schedule, retry on transient failures, gate releases, and feed results back into the next iteration.

Arize AX is the harness and the provider is the connective tissue. Together they turn evaluation from a development-time exercise into a production system that runs whether you’re paying attention or not.

Airflow orchestrating Arize AX LLMOps workflows for production AI systems

A real example: the LLM CI/CD gate

Let’s walk through a concrete workflow that most production LLM teams need: a CI/CD gate that blocks deployment if evaluation scores regress.

The workflow is simple: before a prompt version receives the production label, run an experiment against the evaluation dataset and compare it with the current baseline. If the candidate regresses on required metrics, the Airflow task fails and the promotion task never runs.

With the Arize AX Airflow Provider, this entire workflow is a handful of operators:


from datetime import datetime
  from typing import Any
  from airflow import DAG
  from airflow.models import Variable
  from airflow.providers.standard.operators.python import PythonOperator
  from airflow.providers.arize_ax.operators.experiments import (
      ArizeAxCompareExperimentsOperator,
  )
  from airflow.providers.arize_ax.operators.prompts import (
      ArizeAxGetPromptOperator,
      ArizeAxPromotePromptOperator,
  )
  from airflow.providers.arize_ax.operators.tasks import (
      ArizeAxCreateRunExperimentTaskOperator,
      ArizeAxGetTaskRunOperator,
      ArizeAxTriggerTaskRunOperator,
  )
  from airflow.providers.arize_ax.sensors.arize_ax import ArizeAxTaskRunSensor
  def build_run_config_from_prompt(**ctx) -> dict[str, Any]:
      """Materialize a Prompt Hub prompt version into a server-side run config."""
      prompt = ctx["ti"].xcom_pull(task_ids="fetch_candidate_prompt")
      version = prompt["version"]
      return {
          "experiment_type": "llm_generation",
          "ai_integration_id": Variable.get("arize_ax_ai_integration_id"),
          "model_name": version["model"],
          "messages": version["messages"],
          "input_variable_format": version["input_variable_format"],
          "invocation_parameters": version.get("invocation_params") or {},
          "provider_parameters": version.get("provider_params") or {},
      }
  with DAG(
      dag_id="llm_cicd_gate",
      start_date=datetime(2026, 1, 1),
      schedule="@daily",
      catchup=False,
      render_template_as_native_obj=True,  # required: dict XCom must stay a dict
  ) as dag:
  
      # 1. Fetch the candidate prompt version from Prompt Hub.
      fetch_candidate_prompt = ArizeAxGetPromptOperator(
          task_id="fetch_candidate_prompt",
          prompt_name="{{ var.value.arize_ax_prompt_name }}",
          version_label="staging",
      )
      # 2. Translate the prompt into a server-side run config.
      build_run_config = PythonOperator(
          task_id="build_run_config",
          python_callable=build_run_config_from_prompt,
      )
  
      # 3. Create the Eval Hub task — Arize executes the LLM, not the worker.
      create_task = ArizeAxCreateRunExperimentTaskOperator(
          task_id="create_candidate_task",
          name="candidate-{{ ds_nodash }}",
          dataset_id="{{ var.value.arize_ax_eval_dataset_id }}",
          run_configuration="{{ ti.xcom_pull(task_ids='build_run_config') }}",
          if_exists="skip",
      )
      # 4. Trigger the run.
      trigger_run = ArizeAxTriggerTaskRunOperator(
          task_id="trigger_candidate_run",
          task_id_param="{{ ti.xcom_pull(task_ids='create_candidate_task') }}",
          experiment_name="candidate-{{ ds_nodash }}",
      )
      # 5. Wait until the run completes.
      wait_for_run = ArizeAxTaskRunSensor(
          task_id="wait_for_candidate_run",
          run_id="{{ ti.xcom_pull(task_ids='trigger_candidate_run') }}",
          poke_interval=15,
          timeout=900,
          mode="reschedule",
      )
      # 6. Fetch the resulting experiment_id.
      get_result = ArizeAxGetTaskRunOperator(
          task_id="get_candidate_result",
          run_id="{{ ti.xcom_pull(task_ids='trigger_candidate_run') }}",
      )
      # 7. Gate: compare against the production baseline.
      compare = ArizeAxCompareExperimentsOperator(
          task_id="compare_to_baseline",
          baseline_experiment_id="{{ var.value.arize_ax_baseline_experiment_id }}",
          candidate_experiment_id="{{ ti.xcom_pull(task_ids='get_candidate_result')['experiment_id'] }}",
          metric_names=["accuracy"],
          pass_threshold=0.0,
          fail_on_regression=True,
      )
  
      # 8. Promote the same prompt to production (only on gate-pass).
      promote = ArizeAxPromotePromptOperator(
          task_id="promote_to_production",
          prompt_name="{{ var.value.arize_ax_prompt_name }}",
          label="production",
      )
      fetch_candidate_prompt >> build_run_config >> create_task >> trigger_run
      trigger_run >> wait_for_run >> get_result >> compare >> promote

Beyond CI/CD: what the example DAGs cover

The Arize team provided 19+ example DAGs that cover the most common LLMOps patterns we see teams building. Here are a few worth highlighting:

Drift detection with auto-rollback: Run a daily evaluation experiment, compare metrics against a stable baseline, and if drift is detected, automatically roll the prompt back to the last known good version. The ArizeAxDetectEvalDriftOperator computes per-metric drift and the fail_on_drift=True flag triggers the rollback branch.

Prompt lifecycle management: Version your prompts through a proper staging-to-production pipeline. The prompt lifecycle DAG runs evaluation experiments at each stage, compares candidate scores against the baseline, and only promotes when the gate passes. No manual approvals, just data-driven decisions enforced by ArizeAxCompareExperimentsOperator with fail_on_regression=True.

Behavioral regression testing: Mean scores can hide problems. Two models might have the same average accuracy, but one refuses 30% of questions while the other gives verbose, off-topic answers. The ArizeAxBehavioralRegressionOperator compares output length distributions, refusal rates, format compliance, and sentence counts between a candidate and baseline, catching regressions that aggregate metrics miss.

RAG evaluation pipeline: Export retriever and generator spans from a production RAG application, build a focused evaluation dataset, then run faithfulness and relevance evaluations with LLM-as-judge evaluators. The RAG DAG chains span export, dataset creation, dual experiment runs, and score aggregation into a single scheduled pipeline.

Automated dataset curation: Evaluation datasets go stale as production traffic evolves. The dataset curation DAG filters, deduplicates, and appends high-quality production spans to your evaluation dataset daily keeping it representative without manual effort.

Fine-tuning data preparation: Export high-quality spans as OpenAI-compatible JSONL, validate the file structure, and stage the results in an Arize dataset ready for fine-tuning. The ArizeAxExportSpansToFineTuningOperator handles the format conversion; the DAG handles the pipeline.

Evaluation tasks with continuous scoring: The tasks DAG demonstrates attaching LLM-as-judge evaluators to live production projects, create an evaluator in Eval Hub, wire it to a project via an evaluation task, trigger an on-demand run, wait for completion, and gate deployment on the scores. All with override_evaluations=True so previously scored spans get re-evaluated when the evaluator improves.

Each example DAG is self-contained and ready to run. Copy it into your Airflow dags/ folder, set a few Airflow Variables, and you’re running production LLMOps pipelines.

Putting it all together: from production traces to a fine-tuned SLM

The real value of this provider isn’t any single operator. It’s what happens when you chain them across the full LLMOps lifecycle from the first production trace to a fine-tuned model running cheaper, faster, and better on your domain than the frontier LLM you started with.

Here’s what that looks like as a single, end-to-end Airflow pipeline:

End-to-end Airflow pipeline for an LLM CI/CD evaluation gate

The flow has four phases. Each one chains operators we’ve already talked about, but seeing them together is where the picture clicks.

Phase 1 – Signal: Production traffic flows in millions spans a day, maybe more. You can’t evaluate all of it, and you don’t need to. ArizeAxTriggerTaskRunOperator runs continuous evaluation on live spans through Eval Hub. ArizeAxDetectEvalDriftOperator surfaces metric regressions before users notice them. ArizeAxAdaptiveSamplingOperator picks the priority spans worth deeper review of the ones with high uncertainty, novel inputs, or anomalous outputs. The output of this phase isn’t a dataset. It’s a signal: here are the spans worth your attention.

Phase 2 – Human-in-the-loop: This is where most LLM evaluation pipelines either skip a step or build their own brittle version of it. The provider treats it as a first-class phase. ArizeAxCreateAnnotationQueueOperator routes flagged spans to your SMEs. ArizeAxAssignQueueRecordOperator distributes review work across your team. And critically, ArizeAxEvaluatorCalibrationOperator measures how well your LLM-as-judge evaluators agree with the humans closing the loop. If the LLM judge is drifting away from human judgment, you find out and recalibrate before it poisons everything downstream. In practice, only about 5-10% of spans need human review, the LLM judges handle the rest with measurable accuracy.

Phase 3 – Data: Now you have something that matters: a stream of high-quality, human validated examples. ArizeAxCurateSpansToDatasetOperator filters and deduplicates. ArizeAxAppendDatasetExamplesOperator merges in the HITL annotations. ArizeAxSmartDatasetRefreshOperator keeps the dataset diverse and current as production evolves. Before training, ArizeAxEvalDatasetHealthOperator is a gate that checks freshness, diversity, and coverage. If your dataset is stale or skewed,

training stops here, not after burning the compute. Finally, ArizeAxExportSpansToFineTuningOperator produces an OpenAI-format JSONL file ready for fine-tuning.

Phase 4 – Model: Fine-tuning itself happens outside Airflow TriggerDagRunOperator kicks off your training pipeline, whether that’s the OpenAI fine-tuning API, Together AI, or your own self-hosted vLLM cluster. When training finishes, the rest of the loop is built-in operators again. ArizeAxRunExperimentOperator evaluates the new SLM on your curated dataset. ArizeAxCompareExperimentsOperator with fail_on_regression=True is the gate does the SLM match or beat the baseline GPT-4?. ArizeAxBehavioralRegressionOperator checks that output distribution didn’t shift in unexpected ways. And ArizeAxPromotePromptOperator only flips the production label if every gate passes. No ungated SLM ever reaches users.

None of this requires a custom platform. It’s an Airflow DAG. It runs on the orchestration infrastructure your team already operates, with the retry logic, observability, and audit trails you already trust.

The Arize AX Airflow provider just makes the operators feel native so the DAG that drives a self-improving SLM looks like the DAGs your team is already writing.

The bigger picture: tighter production agent feedback loops

We built these providers because we believe the next phase of LLM adoption isn’t about getting agents to work, but getting them to keep working and keep getting better.

The teams that win will be the ones with the tightest feedback loop between production behavior and system improvement. They’ll be the ones who treat every trace as training data, every evaluation as a test case, and every regression as an opportunity to make the system more robust.

These feedback loops can’t be manual. They’ll need infrastructure and heavy automation.

In our model:

  • Arize AX provides the observability, evaluation, and dataset management.
  • Airflow provides the scheduling, orchestration, and operational guarantees.
  • The provider connects the two so you can build agents that genuinely learn from their own production experience, on a schedule, with retries, with audit trails, and with gates that prevent regressions from reaching users.
Arize AX and Airflow architecture for a self-improving small language model loop

Getting started

Install the Arize AX Airflow provider:

pip install arize-ax-airflow-provider

Configure your Arize connection in Airflow:

  1. Go to Admin > Connections > Add
  2. Set Connection Id to arize_ax_default
  3. Set Password to your Arize API key
  4. Set Extra to {"space_id": "your-space-id"}
Airflow connection configuration for the Arize AX provider

Set your Airflow Variables:


airflow variables set arize_ax_space_id "your-space-id"
airflow variables set arize_ax_project_id "your-project-id"

Copy any example DAG from the provider into your DAGs folder and trigger it. The E2E test DAG (example_arize_ax_e2e_dag) is a good place to start, it exercises every major operator and sensor in a single self-contained run.

Example Arize AX Airflow provider DAG running in Airflow

What’s next

This is the 1.4.x Pre-GA release. We’re actively developing and welcome feedback on the operator APIs, example DAGs, and any LLMOps patterns you’d like to see covered.

If you’re new to Arize AX, the documentation covers the platform features, and the provider’s more information on the operators, sensors, hooks and example DAGs are a practical way to see what’s possible when you bring LLMOps into your orchestration layer.

The Arize AX Airflow Provider requires Python 3.10+, Apache Airflow 2.4+, and Arize SDK v8+

The post From production traces to better AI agents: Automating the LLMOps feedback loop appeared first on Arize AI.

]]>
How to ship a local LLM that matches frontier LLMs with evals and prompt engineering https://arize.com/blog/how-to-ditch-your-frontier-model-for-an-slm/ Tue, 26 May 2026 14:00:24 +0000 https://arize.com/?p=28627 Most production AI features don't need a frontier model. Here's how capability evals and prompt engineering can help ship a local SLM that matches frontier-model quality with lower latency and cost.

The post How to ship a local LLM that matches frontier LLMs with evals and prompt engineering appeared first on Arize AI.

]]>
Most production AI features don’t need a frontier model. Here’s how I used capability evals and prompt engineering to ship a local 3B model that matches Claude Sonnet on quality, runs twice as fast, and costs nothing per call.

I’ve been building Mima, a social and news app that uses AI to summarize conversations, detect toxicity, and add other touches that make navigating the connected web smoother. Of course, I built it using my favorite Large Language Model (LLM), Claude. But now two things were blocking the beta:

  • Keeping the user’s Personally Identifiable Information (PII) on their device and off of third-party servers. This is a skunkworks app, not a funded business with money to throw at GDPR compliance!
  • Keeping costs low. Every call to an Anthropic server is money I could be spending on other things, like a designer or Amazon gift cards for product testers.

In London’s startup scene, I’ve watched many AI-heavy products eat their founders out of house and home on inference costs alone. And Gartner expects total inference spend to keep rising even as per-token prices fall, because agentic workloads consume tokens faster than prices drop. Anthropic itself introduced new rate limits in 2025 after acknowledging that Claude Code usage was growing faster than expected. Today’s prices are subsidized by VC, not unit economics, and when the subsidy ends, every cloud LLM call in your stack becomes a cost center you can’t control.

So I went looking for a way to do most of this work locally. Most production AI features do one narrow task (classify, summarize, extract, translate), and that’s a fraction of what an LLM is capable of. You’re paying for the rest in latency, tokens, and dependency on a service you don’t control.

But small language models (SLMs) sit between 2-16 GB on disk, run on the user’s device, don’t go down when the Wi-Fi does, and cost nothing per call. Foundation models are still best for long-context reasoning or open-ended creative work. But for summarization, extraction, classification, and most of the actual production AI surface, today’s SLMs are more than enough.

Which raises the question: if SLMs are this capable, why isn’t every product using them?

Because picking the right one and proving it’s the right one has been a skill reserved for ML engineers until recently: evals.

Evals are a skill every AI engineer worth their salt needs to learn, and this is how to do it.

Just enough inference with evals

No matter their size, different models are better and worse at different tasks, as we can see from any benchmark comparison. There’s no perfect model, only models of varying capability for your specific task. But most of us look to benchmarks or ask our friends, “What’s the best new model?”

What we really should be asking is “which model is good enough to accomplish my task quickly, accurately, and cheaply?” We need to measure their differing capabilities so we can make an educated trade-off, such as opting for a slower model that offers more accurate results, or vice versa.

To measure a model’s capabilities, you’ll need evals.

Evals are to models what tests are to code. Well, not quite. With code, we’re testing for specific outcomes. 2 + 2 = 4, always. With evals, we’re testing acceptable outcomes. The eval for “What’s the capital of France?” would accept “Paris,” “The capital of France is Paris,” “It’s Paris!” and possibly even geographic coordinates! This makes evals more appropriate for non-deterministic code. You’re asking, “Across a representative set of inputs, does this model produce outputs that meet our bar often enough to ship?”

Finding a SAGE (Small And Good Enough) model

In the “prototype big, ship small” framework, you prototype any AI feature or product with a SOTA (state of the art) model, just to make sure what you’re trying to do is physically possible. It will also give you the results with the least effort. In four steps, you’ll be able to select the smallest model capable of performing within the larger model’s range of expected outcomes:

  1. Prove it’s possible. Use the best model you can to prototype the outcome you are looking for (like Gemini for translating French comic scans because it’s multimodal).
  2. Set success criteria. Collect a set of inputs and ideal outputs (the comic scripts in French and their correct translations in English, for instance).
  3. Test from small to large. Compare the outputs of smaller models against your test criteria. Work your way up from the smallest model until you get “close enough” to your baseline LLM. (What counts as “close enough” depends on your use case.)
  4. Select the smallest model that gives acceptable responses for your use case.

This is your SAGE model: Small and Good Enough.

Each step matters and skipping any of them is how you end up with a model that “kind of works” or falls apart in an edge case you didn’t consider.

Step 1: Proving the feature with Claude

I had already built two conversation summarization features to make calls to Claude Sonnet, and I was satisfied with the results. These were my baseline, the measuring stick against which all other models needed to measure up to.

Sonnet’s summarization was impeccable, but the cost was high: 28 summaries ~ $0.44 USD. Manageable for testing, but untenable for scaling. This performance formed the baseline for my golden dataset.

Step 2: Building the rubric and creating the golden dataset

A “golden dataset” is a set of ideal outcomes to measure your model’s generated outputs against. Without one, you don’t have a measuring stick to compare different outputs against. You’ll just be going on vibes, which don’t seem problematic when you’re prototyping, but become troublesome when you can’t hand-test every impacted surface later on in the product cycle, after upgrading a model, or changing a prompt.

I curated my golden dataset from 14 real, public conversations and their Sonnet-generated summaries. Each input (a conversation thread) is paired with two outputs, summaries, one for a list view, and another for recapping long chats in a thread.

I chose Arize Phoenix for my eval harness. It’s open-source, local-first, and OpenAI-compatible. It’s maintained by the core engineers at Arize, who I just so happen to work with as well!

To kick things off, I made a baseline trace recording these metrics using Claude and the golden dataset. A trace is a log of everything that happened during one model call: the input prompt, the output, intermediate steps (if the model used tools or made sub-calls), timing, token counts, and any errors. It’s a complete log of one execution that you can replay, inspect, and reason about after the fact.

I chose the following metrics to weigh:

  • JSON validity (code): Does the output parse?
  • Reference structural validity (code): Do citations point to real messages?
  • Factual consistency (LLM-as-judge): Does the summary stay faithful to the thread?
  • Length compliance (code): Does it stay in the target word range?
  • p50 latency (code): typical case
  • p95 latency (code): worst case

To decide whether an output is good or not, you’ll need an evaluator. There are three kinds of evaluators:

  • Human: the oldest kind—humans have been evaluating code outputs since the beginning of AI research! (Also the most expensive evaluator.)
  • Code-based: Deterministic, fast, free, reproducible. You use these in unit testing all the time. Was the output formatted correctly? Was it the right type? Did foo === foo ?? The cheapest evaluator.
  • LLM-as-judge: Good for subjective qualities a regex can’t capture (tone policing, faithfulness). You give a (usually larger) model the input, the output, and a rubric, and ask it to score. LLM-as-judge is slower and more expensive, so look for ways to measure “good enough” with code.

Notice that most of these metrics can be validated with code alone. But for equivalence, I needed an LLM-as-judge to compare outputs to the baseline traces.

To find the best model for the job, you’ll need to collect traces from experiments with other models and the golden dataset.

Step 3: Testing all the models

SLM capability evals and prompt engineering figure

My first instinct was to ask the ML engineers I respect and admire if there were any smaller models they thought might be a good starting place. Almost all recommended Gemma 4, a more than capable small model that’s been getting a lot of praise. And if I didn’t have evals, I might have chosen Gemma 4 and saddled my users with a less-than-ideal experience. This is why it’s important to run experiments on a range of models.

I chose Gemma 4 E4B-it with 4-bit quantization, weighing in at a hefty 5 GB on disk. This was the upper end of what I could expect a user to voluntarily download on a desktop. To round out the scale from smallest to largest and add vendor diversity, I chose the following models to compete:

  • Qwen 2.5 1.5B was already shipping in the app as a backup when Anthropic was offline.
  • Qwen 3 1.7B is in the same family, same footprint, no architecture change, but an upgrade over the incumbent.
  • Llama 3.2 3B is the most battle-tested model in node-llama-cpp, so it tells you what “fully baked, definitely works” looks like at this size class.

In Phoenix, I set up each model as its own experiment to test its capability. This is called a “capability eval,” and you usually run these at the start of a project or when you’re otherwise determining which prompt or model to use for a feature.

I ran the evals three times for every input and model combination to help iron out any outliers, so each model collected 84 evals (3*28 summaries). Each experiment used the same golden dataset and the same evaluators. The only variable was the model.

SLM capability evals and prompt engineering figure

Step 4: Choose the SAGE (Small and Good Enough)

One of the challenges with measuring models is that there are rarely clear winners. Often, you end up trading accuracy vs latency.

This chart is called a Pareto scatter. Each dot is a model, plotted on two axes: accuracy and latency. The Pareto frontier is the curve traced by models that are both faster and more accurate than any other model. Anything below the frontier is irrelevant because there’s a better option available. Anything on the frontier represents a real tradeoff. There’s no “best” model on the frontier without first specifying what you’re willing to trade, which is exactly what setting success criteria in Step 2 forces you to do.

SLM capability evals and prompt engineering figure

Looking at this chart, only Sonnet, Llama 3.2, and Gemma 4 are worth comparing. The two Qwens were soundly surpassed.

Even though Qwen 2.5 was the fastest at p50 (the median or 50th percentile), it hallucinated references to nonexistent messages 27% of the time, vs. Llama’s 11%. Speed was important, but a fast feature that doesn’t work correctly is just a fast bug.

One way to mitigate this would be to run the inference several times and pick the accurate output, but that would eliminate the speed advantage, as comparison adds latency to the equation.

Gemma 4 was the quality outlier (95% reference accuracy), but it was disqualified due to latency at 7+ seconds. It was worse than Sonnet by multiple seconds, which users are more than sensitive to.

That left Llama 3.2b as the best “good enough” alternative to Claude Sonnet 4.6. Without evals, comparing these models would have been impossible. I would likely have chosen Gemma 4 because of its popularity and reputation. The lesson learned: Don’t trust. Evaluate.

Close the gap between SLMs and LLMs with prompt engineering

Llama 3.2 was almost my SAGE model, but that 11% hallucination rate had to be snuffed out. This is where prompt engineering comes in.

Remember when everyone thought we were going to be prompt engineers? Well, prompt engineering, like evals, is one of a set of skills you need to wrangle models.

If fine-tuning really is dead, as per Anthropic’s Emmanuel Ameisen, prompt engineering has taken its place. Fine-tuning changes what the model knows by updating the model’s weights through retraining, creating a more specialized model. Prompt engineering changes what the model does with what it knows by changing only the inputs (data, prompts) you give the model.

The techniques that work also depend on the model class. Reasoning models like GPT-o1 and Claude with extended thinking now handle the chain-of-thought work internally, which has retired a lot of the in-context-learning tricks people used in 2022-2024. But on a 3B local model, those tricks still have impact. The model needs help structuring its output that a reasoning model gives itself.

Revisit “what is good enough”

At this point, you’ve narrowed your competition to two models, and you should have a sense of which metrics are deal breakers and which are nice-to-haves. For me, I learned that smaller models consistently failed to conform to word counts, so I accepted that I’d have to use truncation on the UI side for some outputs.

You should also have an idea of what the bar is for metrics you’re still tracking:

Metric Bar Why it matters
JSON and Reference structural validity ≥99% The outputs must be parseable or it will introduce bugs to the system
Factual consistency ≥95% anti-hallucination bar; the 5% slack accounts for genuine ambiguity and reasonable inference rather than outright invention
p50 latency ≤1500ms Feels “instant enough” on M-series Mac
p95 latency ≤3500ms Comes in under the 4s mark in a worst case scenario.

One variable per variant

Rather than generating a bunch of different prompts and hoping for the best, come up with some theories about what might drive the outputs in the right direction. I needed to reduce the references to conversations that didn’t exist.

I could do this by reformatting the input or showing the model “how it’s done” with examples. I could tell it what not to do. I could make it think long and hard before giving a response. Then I created four variants plus a control to run as experiments with Phoenix:

Variant Lever pulled What changed
Baseline (control) Minimal instruction. Establishes the floor.
Reformatted input Format Same instructions, but the thread was reformatted from JSON array to natural-language numbered messages.
Few shot Demonstration Same instructions, plus three worked input/output examples embedded in the prompt.
Explicit rules Constraint Same instructions, plus literal prohibitions (“no preamble,” “count words before responding,” “never invent messages”).
Chain of Thought Process Same instructions, restructured so the model identified key moments before writing the summary.

This isolation allowed me to measure how each prompt impacted each “definition of good.” Phoenix’s compare view lets you compare the same dataset, same evaluators, with the prompt as the variable.

Table of how five prompt variants performed, with deltas measured against the baseline. Baseline scored 77.4% length, 91.2% reference accuracy, 87.1% factual consistency, 1055ms latency. Reformatted input barely moved quality (+1.2 length, −1.1 ref, +0.6 factual) and added 606ms latency. Few-shot improved every quality metric — +10.0 length, +8.3 ref, +5.8 factual — for only +241ms. Explicit rules regressed across the board: −4.8 length, −6.6 ref, −3.4 factual, latency roughly flat. Chain of thought improved length by +5.9 but regressed reference accuracy by −5.3 and factual consistency by −1.9, while adding 638ms latency. Few-shot was the only variant whose quality gains were all positive.All but one of the prompts were noise or actively harmful. If you were going on pure vibes, you might try to “improve” your prompt by explicitly telling the model what not to do without realizing how much it was degrading the outputs.

Few-shot was the standout, with quality improving across every metric. Llama3.2b might not be good at following instructions, but it’s pretty good at imitating examples.

The new prompt got me closer, but there was still work to do to meet the bar.

And the second: Side-by-side comparison of Claude Sonnet against Llama 3.2 3B with the few-shot prompt, scored against six shipping bars. JSON validity: both 100%, both pass the ≥99% bar. Reference structural validity: Claude passes at 100%, Llama misses at 91.7%. Factual consistency: Llama 92.9% against a ≥95% bar; Claude has no score because it's the LLM-as-judge and can't fairly score itself. Length compliance: Claude 100%, Llama 93.3%, both pass the ≥90% bar. p50 latency: Claude 3046ms misses the ≤1500ms bar, Llama passes at 1296ms. p95 latency: Claude 4750ms and Llama 3998ms both miss the ≤3500ms bar. Llama beats Claude on every latency measurement but still falls short on reference structural validity and p95 latency — gaps that engineering will close. Scores are averaged across a 28-example golden dataset with three repetitions each.

Code is cheaper than inference

Claude Sonnet was capable of meeting my bar for everything but latency. Llama 3.2B was 16-25% faster, likely because of the time saved roundtripping to a remote server. However, even with the few-shot prompt, it still fell short on structural validity and length compliance.

Since code is cheaper than inference, I looked for deterministic solutions to these problems.

  • I used CSS truncation to lop off any stray words at the end of a summary. No one will miss them in the context they’re in.
  • The few-shot approach did bloat input tokens, putting the p95 latency over budget, but I was able to claw that back using a KV cache.
  • I added a post-hoc validator to strip any [ref:N] outside the valid message range.

It’s important to check a sampling of traces yourself. The 92.9% vs. Claude’s near 100% was dismissed because human review confirmed the gap is an overly strict judge, not actual hallucination. The SLM phrased things differently, but not factually incorrectly.

In this way, I was able to get the model to a place where it performed as well or better than Claude Sonnet across the board, shaving almost 2 seconds off the p50 latency and saving myself a monthly bill:

Two-column comparison titled "Claude vs the shipped local configuration" — Claude Sonnet (cloud, left) against Llama 3.2 3B with the V3 few-shot prompt plus post-hoc safety nets (local, right). JSON validity: both 100%. Reference structural validity: both 100% — Llama achieves this via a post-hoc validator that strips any [ref:N] tokens outside the valid message range. Factual consistency: Llama 92.9%; Claude has no score because it's the LLM-as-judge and can't fairly score itself. Length compliance: both 100% — Llama achieves this via post-hoc word-count truncation enforcing the length spec deterministically. p50 latency: Claude 3046ms, Llama 1296ms — Llama more than twice as fast. p95 latency: Claude 4750ms, Llama under 3500ms — achieved with KV cache reuse on the few-shot prefix; V3 alone measured 3998ms. The shipped local config matches or beats Claude on every metric, with code closing the gaps the model couldn't.

The eval tells you where a model is capable. Use engineering to close the gap on what the model can’t do.

Life after capability evals

Now that the system was working, the next steps involved setting up mechanisms to get the model onto the user’s device, building features with progressive enhancement in mind (what happens while the model is MIA?), and setting up regression evals. These are what alert you when a new user input, a prompt edit, or a model change affects the model’s output. You can add them to your CI/CD to catch these shifts before they reach your customers.

Capability evals are often run once, but regression evals live with your testing suites forever. (Let me know if you’d like to hear about that side of the story, too.)

It’s dangerous expensive out there. Take this with you.

Every time you call a SOTA model in your stack, you should ask: does this really need a frontier model, or is it a vestige of Prototyping Big? Have you been using LLMs as placeholders for smaller models in your codebase? Can you tighten and streamline your inference?

I challenge you to audit one feature in your app this week. Could it run on a local instead of a more expensive frontier model?

Set up Arize Phoenix, then run some of your own prompts and models against lighter ones using llama.cpp. The results might surprise you.

SLM capability evals and prompt engineering figure

The post How to ship a local LLM that matches frontier LLMs with evals and prompt engineering appeared first on Arize AI.

]]>
How to build LLM-as-a-Judge evaluators that hold up in production https://arize.com/blog/how-to-build-llm-as-a-judge-evaluators-that-hold-up-in-production/ Thu, 21 May 2026 14:00:45 +0000 https://arize.com/?p=28613 Learn how to design, calibrate, and run LLM-as-a-judge evaluators with fixed labels, human agreement checks, trace context, and Phoenix Evals.

The post How to build LLM-as-a-Judge evaluators that hold up in production appeared first on Arize AI.

]]>
Manual review doesn’t scale forever. At some point, if you’re building an LLM app or agent, you need a way to evaluate more than a handful of examples at a time. That’s where LLM-as-a-Judge can help.

But an LLM judge only works if you’re clear about what it’s judging.

Here’s the failure mode teams run into: a support agent tells a user their refund was processed. The judge marks the answer as “helpful.” Your dashboard shows a passing score. But when you open the trace, the agent never called the refund tool, never checked the customer account, and never verified the refund policy.

The answer sounded fine, but the system still failed. That’s the gap you’re trying to close.

In this guide, we will walk through how to build LLM-as-a-Judge evaluators that are useful in production: when to use code instead of a judge, how to define evaluation criteria, why fixed labels are often better than open-ended scores, how to compare judge results against human review, how to evaluate agent trajectories, and how to keep cost and latency under control.

TL;DR

  • Use code evaluators for deterministic checks: schema validity, exact match, regex match, latency, token count, tool name, and required fields.
  • Use LLM judges for semantic checks: correctness, faithfulness, helpfulness, safety, tone, user frustration, task completion, and tool-call appropriateness.
  • Treat the evals as the product. The judge model only applies the evals.
  • Prefer boolean or categorical labels when the decision is discrete. If the judge may not have enough evidence to decide, add a third category if necessary such as insufficient_evidence or needs_review rather than forcing a binary label.
  • Validate the judge against human labels before using it for gates, dashboards, or automated routing.
  • For agents, evaluate both the final answer and the trajectory: tool choice, tool arguments, redundant steps, recovery behavior, and session outcome.
  • Keep eval results close to traces, spans, sessions, datasets, and experiments. Phoenix Evals is designed for that workflow.
  • Track judge behavior over time. A judge can drift just like the application it evaluates.

What is LLM-as-a-Judge?

LLM-as-a-Judge is an evaluation pattern where a language model grades another model’s output, trace, or session against a set of criteria.

The judge might return:

  • a boolean label, like correct or incorrect
  • a category, like resolved, partially_resolved, or unresolved
  • an ordinal rating, like A through E or 1 through 5 with anchored definitions
  • an explanation
  • structured evidence

If you want the primer version before going deeper, start with Arize’s LLM-as-a-Judge overview.

A judge that can produce a grade is useful. But the real value comes when you can apply the same evaluation criteria across many examples, compare results over time, and route failures back into your engineering loop.

That’s where observability and evaluation work together.

Observability shows you what happened while evaluation tells you whether what happened was good enough. The improvement loop connects the two: observe the behavior, measure the failure, change the system, and verify that the fix worked.

That’s the bar. A judge that produces labels without enough context to inspect, calibrate, or act on them is not doing the job.

Use code when the check is deterministic

Okay, let’s start with the obvious question: do you need an LLM judge at all?

Because a lot of evals don’t.

If the output must be valid JSON, use a parser. If the answer must include a known ID, check the ID. If the agent must call lookup_customer_profile before refund_order, programmatically inspect the trace. If latency is above three seconds, no model needs to reason about that.

Code is cheaper, faster, and more predictable.

LLM judges are useful when the evaluation depends on meaning:

  • Did the answer actually address the user’s question?
  • Is the answer grounded in the retrieved context?
  • Did the model cite a source that supports the claim?
  • Is the response safe without being needlessly evasive?
  • Did the agent choose a reasonable tool sequence for the task?
  • Did the conversation end with the user getting what they needed?

Those are hard to express as rules. They’re also the questions that decide whether your system works.

Decision tree showing when to use deterministic code evaluators versus an LLM judge, based on whether the evaluation can be checked programmatically or depends on semantic meaning.

A common mistake is treating code evals and judge evals like they’re competing options when they’re not.

A support agent might use code evaluators for JSON validity, tool-call schema, latency, and escalation policy. It might also use LLM judges for answer correctness, tone, frustration, and task completion.

A RAG system might use code to check citation format, then use a judge to decide whether the cited source actually supports the answer.

Here’s a practical rule: if the answer can be checked without interpretation, use code. If the answer depends on meaning, use a judge. If the result will drive an automated action, attach the judge output to trace context so someone can inspect why it fired.

Here’s the kind of evaluator that should stay in code:


import json

from jsonschema import ValidationError, validate

TOOL_CALL_SCHEMA = {
    "type": "object",
    "required": ["tool_name", "arguments"],
    "properties": {
        "tool_name": {"type": "string"},
        "arguments": {"type": "object"},
    },
}

ALLOWED_TOOLS = {"lookup_customer_profile", "refund_order"}


def valid_tool_call(output: str) -> bool:
    try:
        payload = json.loads(output)
        validate(payload, TOOL_CALL_SCHEMA)
        return payload["tool_name"] in ALLOWED_TOOLS
    except (json.JSONDecodeError, ValidationError, KeyError):
        return False

This is a parser and schema problem, so there’s no need for a judge. Instead, you can save the LLM judge for the question the parser can’t answer: did the agent call the right tool for the user’s actual request?

Design the evaluation criteria before you write the prompt

Most bad judges fail before the model is called.

The prompt says “rate helpfulness from 1 to 5” or “determine whether the answer is good.” The model returns a confident score, and everyone moves on.

That score is not meaningful because “good” “helpful” and “correct” aren’t an evaluation criteria until you define what evidence counts, which edge cases matter, and how to resolve ambiguity.

(This is why Arize’s guide to building a custom LLM evaluator with a benchmark dataset starts with label definitions and annotated examples instead of model choice.)

A strong evaluation criteria usually includes five pieces:

  1. The evaluation target: what quality are you measuring?
  2. Inputs: what evidence can the judge use?
  3. Labels or scores: what outputs are allowed?
  4. Decision rules: how should the judge handle edge cases?
  5. Examples: what does each label look like in practice?
Comparison between a weak evaluation prompt and a strong judge rubric, highlighting required components like evaluation target, inputs, labels, decision rules, and examples.

For a customer support agent, a task completion evaluation criteria might look like this:


Evaluation target: Did the agent resolve the user's support request?

Allowed labels:

- resolved: The user received a correct, actionable answer or the requested action was completed with the required supporting evidence.
- partially_resolved: The agent made progress but left a required step incomplete.
- unresolved: The agent failed to answer, gave incorrect guidance, skipped required evidence, or created a loop.
- insufficient_evidence: The trace does not contain enough evidence to score task completion.

Decision rules:
- Do not mark "resolved" if the user had to repeat the same request.
- Do not mark "resolved" unless required tool evidence is present.
- Do not mark "unresolved" just because the agent escalated; evaluate escalation quality separately.
- If the final answer is correct but the agent used an unnecessary tool, mark task completion separately from efficiency.
- If the answer is plausible but unsupported by tool results, mark unresolved.
- If required tool results or session history are missing, mark insufficient_evidence rather than guessing.

The key is that this evaluation criteria measures one thing: task completion.

Escalation quality should be a separate evaluator. A correct escalation can be the right product behavior, but it answers a different question from whether the agent itself resolved the request. Likewise, insufficient_evidence is a data-quality or review-routing outcome, not a task-completion score.

Good eval design keeps those dimensions separate: task completion, escalation quality, evidence availability, and efficiency may all matter, but they should not be averaged into one ambiguous score.

Start with the decision the eval will drive

The output format is more than an implementation detail, and changes the reliability of the evaluator.

In practice, teams tend to use four output types:

  • Boolean: true or false
  • Categorical: failure_type, escalation_reason, or support_intent
  • Ordinal categorical: resolved, partially_resolved, unresolved
  • Numeric: a continuous score such as 0.0 to 1.0 or 1 to 10
Spectrum of LLM judge output types from boolean and categorical to ordinal and numeric, showing tradeoffs between stability and calibration complexity.

Use the simplest output type that matches the decision.

Boolean labels work well for policy checks and gates: hallucinated or factual, valid or invalid, in scope or out of scope, user frustrated or not frustrated. They’re easier to calibrate, easier to aggregate, and easier to turn into deployment gates.

Boolean labels don’t work when the judge might lack enough evidence to decide. In that case, don’t force true or false. Add uncertain, insufficient_evidence, or needs_review. (Forced binary labels make dashboards look cleaner while making the measurement worse.)

Categorical labels work when there are a few distinct states. They’re useful for failure analysis because they preserve more information than pass/fail without pretending to be continuous.

Ordinal labels work when gradation matters, but each level needs an anchor. resolved, partially_resolved, and unresolved can be treated as ordered labels if they all describe the same underlying dimension. Do not add a label like escalated to that same scale unless escalation is explicitly defined as a resolution state in your product logic.

Open numeric scores are the most tempting and the easiest to misuse.

In our own testing at Arize, numeric scores produced plateaus, discontinuous jumps, and model-specific scale drift. A judge could separate clean text from badly corrupted text, but fail to distinguish medium levels of corruption. Changing the scale from 1-to-10 to 0-to-1 changed the distribution without improving the measurement. Reasoning models reduced variance in some cases, but they did not magically turn token outputs into calibrated instruments. (We wrote more about this in Testing binary vs score evals on the latest models.)

That doesn’t mean numeric scores are always wrong. Instead, it means they need more discipline. Use them when you have a clear underlying continuum, a calibrated validation set, and a reason to preserve fine-grained differences. Otherwise, boolean and categorical labels are usually more stable.

Run the evaluator where developers can inspect the trace

A judge label is only useful if the team can inspect the execution record behind it.

If a judge says an answer was unsupported, you need to see the retrieved documents, prompt version, model output, tool calls, intermediate steps, and final response. If a judge says an agent failed to complete a task, you need to see whether the issue came from planning, retrieval, tool selection, tool arguments, tool results, or final response generation.

That’s why eval results should live near traces, spans, sessions, datasets, and experiments.

Our open source AI observability project Phoenix makes this workflow concrete. Phoenix Evals gives teams a starting point without building every evaluator from scratch, and the Phoenix Evals documentation covers the core primitives. For common tasks, you can use built-in evaluators such as faithfulness, correctness, document relevance, refusal, tool invocation, tool selection, and tool response handling.

For application-specific behavior, try building a custom evaluator from real traces:

  1. Pull a representative set of examples from production or pre-production traces.
  2. Annotate those examples with the labels your team actually uses.
  3. Write an evaluation criteria with fixed labels and decision rules.
  4. Run the judge on the labeled set.
  5. Inspect disagreements in Phoenix.
  6. Tighten the evaluation criteria or add examples.
  7. Log eval results back to traces, spans, sessions, datasets, or experiments.

In production, the evaluator should run against examples pulled from traces, not hand-written examples in a notebook.

Once the eval is running, the workflow isn’t “look at the average score.” Instead, it’s:

  • Filter failed examples.
  • Inspect the trace and judge explanation.
  • Group failures by cause.
  • Add representative failures to a dataset.
  • Re-run the dataset before prompt, model, retrieval, or tool changes.
  • Track whether the fix improved the target failure without introducing a new one.

That’s how an LLM judge becomes part of the engineering loop.

Choose the judge model after the evaluation criteria is stable

Here’s something you may find counterintuitive: the strongest model isn’t always the best judge.

  • A frontier model may improve agreement on complex reasoning tasks, but it can be too slow or expensive for broad online monitoring.
  • A smaller model may be enough for binary labels with clear evidence.
  • An open model may be necessary when data cannot leave a controlled environment.
  • A cross-family judge may reduce self-preference when you are comparing outputs from one model provider.
  • Open evaluator models such as Prometheus 2 are useful to know because they show a separate path: train or choose a model specialized for evaluation rather than using a general assistant model for everything.

You should validate your model choice the same way you validate the prompt:

  • Run each candidate judge on the same labeled validation set.
  • Compare agreement with human labels, not just average score.
  • Inspect disagreements by failure type.
  • Measure latency, token usage, and cost per evaluated example.
  • Re-run a fixed canary set when the judge model changes.

If a stronger judge improves agreement by one point but doubles latency and cost, it may still be the wrong choice. If it catches the exact failure that would block a release, it may be worth it. The decision depends on the action tied to the eval.

Ask for explanations, but don’t confuse them with truth

A judge’s explanation is useful for debugging, but it’s not proof.

Many LLM-as-a-Judge workflows ask the model to return both a label and an explanation. That is a good default. Explanations help reviewers see why a judge failed, identify ambiguous evaluation criteria language, and build better calibration examples.

They also make it easier to inspect patterns across failures. The judge may be overweighting tone, ignoring context, or treating missing citations as hallucinations.

But explanations can rationalize a bad label. The model may produce a plausible reason after making the wrong call. That’s why the output contract should separate fields:


{
  "label": "unsupported",
  "explanation": "The answer says the refund will arrive in 24 hours, but the policy context only says refunds are usually processed within 5 business days.",
  "evidence": ["refunds are usually processed within 5 business days"]
}

For simple checks, brief reasoning is enough. But for complex checks, you should ask for evidence. If the judge claims an answer is unsupported, have it identify the unsupported claim and the relevant source text. If it says a tool call was wrong, have it name the tool that should have been used and why.

Research such as G-Eval and the MT-Bench and Chatbot Arena judge paper shows that structured evaluation steps can improve alignment with human judgments in some settings. But the lesson isn’t “always ask the judge to think step by step.” Explicit chain-of-thought increases tokens, latency, and complexity. It can help when the judge must reason through multiple dependent criteria, but it’s often unnecessary for simple classification tasks.

You should use reasoning because it improves auditability, and measure whether it improves agreement.

Calibrate against humans before scaling

The first version of a judge should always be treated as a hypothesis.

In practice, the first judge rarely fails because the model’s weak. Instead, it often fails because the evaluation criteria is underspecified. Reviewers disagree on edge cases, the judge rewards fluent unsupported answers, or a single aggregate score hides the distinction between correctness, grounding, and task completion.

Build a small validation set from real examples. Include obvious passes, obvious failures, and the cases that triggered disagreements among your team. Label them with human reviewers. If the task is ambiguous, collect more than one human label per example and keep the disagreements.

Human disagreement is often the signal that your evaluation criteria is underspecified. Phoenix’s annotation workflow is built around this idea: use human labels where they matter most, then turn them into reusable evaluator and experiment data.

After that, you should run the judge and compare:

  • Accuracy for boolean or categorical labels
  • Precision, recall, and F-score for failure detection
  • Cohen’s kappa or weighted kappa when measuring agreement beyond chance
  • Confusion matrices to see which labels collapse into each other
  • Rank correlation when the output is ordinal
  • Disagreement slices by domain, prompt version, model, user segment, and trace type

(Arize’s paper reading on Judging the Judges is a good companion here because it walks through percent agreement and Cohen’s kappa for LLM judges.)

A judge with 85% overall agreement may still be unusable if it misses the failures you care about. A hallucination judge that catches obvious unsupported answers but misses overconfident extrapolations will look better than it is. A safety judge that over-flags harmless edge cases may be acceptable for review routing but too noisy for deployment gating.

Suppose you label 100 support-agent sessions for task completion:

  • 55 are resolved
  • 20 are partially_resolved
  • 15 are unresolved
  • 10 do not contain enough trace evidence to judge

The first judge agrees with humans on 82 examples. That sounds good until you inspect the 18 disagreements. It marked 9 partially_resolved sessions as resolved because the final answer sounded helpful, even though the agent never completed the required account lookup. It marked 4 incomplete traces as unresolved because the evaluation criteria did not explain when escalation is the right outcome. And it marked 5 examples differently because the trace was missing a tool result.

Those are three different fixes:

  • Add a decision rule that a request is not resolved unless required tool evidence is present.
  • Treat missing trace evidence as insufficient_evidence, not as an agent failure.
  • Add examples where the final answer sounds plausible but is unsupported by the trace.

The point of calibration is to learn how the judge fails before you use it to gate a release or monitor production.

Ideally, calibration should be an iteration loop that consists of the following:

  1. Label a representative set.
  2. Run the judge.
  3. Review disagreements.
  4. Update the evaluation criteria, examples, or output labels.
  5. Re-run on the same set and a holdout set.
  6. Track agreement over time.

And remember: some labels are genuinely indeterminate. A 2025 paper on validating LLM-as-a-Judge systems under rating indeterminacy shows why forced-choice labels can make judge validation look cleaner than it is. In practice, that is another argument for needs_review, multi-label annotations, and keeping human disagreements visible.

That’s the difference between an eval and an improvement loop.

Design for known judge biases

LLM judges have failure modes, and ignoring them doesn’t make them go away.

Here are some common LLM failure modes:

  • Position bias: in pairwise comparisons, the judge may prefer the first or second answer because of placement.
  • Verbosity bias: the judge may reward longer answers even when the extra text is redundant or wrong.
  • Self-preference bias: a judge may prefer outputs that resemble its own model family or style.
  • Authority bias: confident language may be rewarded over calibrated uncertainty.
  • Evaluation criteria drift: the judge’s behavior changes when prompts, models, or input distributions change.
  • Hallucinated reasoning: the explanation sounds plausible but cites evidence incorrectly.
Table mapping common LLM judge biases such as position bias and verbosity bias to mitigation strategies like randomization, rubric constraints, and evidence requirements.

Treat the judge like production code and production measurement infrastructure: version it, test it, and monitor it.

Evaluate agent trajectories, not just final answers

Agents fail in ways single-turn LLM apps do not.

A chatbot can be judged mostly on the final answer, but an agent has a path. It plans, calls tools, observes results, updates state, retries, escalates, or loops. The final response may look fine while the trajectory is wasteful, risky, or unsupported.

That’s why Arize has spent so much time on tracing and evaluating agents, not just scoring final text.

In our experience, we’ve found that it’s best to evaluate agents at multiple levels:

Final answer quality

Did the answer solve the user’s problem? Was it correct, grounded, and complete?

Tool selection

Did the agent choose the right tool for the task? Did it call tools it did not need?

Tool arguments

Were the arguments valid, safe, and specific? Did the agent hallucinate IDs, dates, filters, or customer attributes?

Tool response handling

Did the agent correctly interpret the result? Did it ignore an error? Did it retry intelligently?

Trajectory efficiency

Did it take the shortest reasonable path, or did it loop through redundant calls?

Session outcome

Across the full conversation, did the user reach the goal?

Some of these can be code evaluators. If the reference trajectory requires search_docs and lookup_policy, you can check whether those calls occurred. Trajectory evaluators usually need several matching modes for exactly this reason: strict, unordered, subset, and superset. If order matters, use strict matching. If the same tools can be called in different orders, use unordered matching. If extra tools are acceptable, use superset. If extra tools are risky, use subset.

And use an LLM judge when the trajectory is reasonable, but not exact.

For multi-turn systems, session-level analysis matters too. Arize’s guide to agent session summaries shows how summarizing trajectories can make review and eval workflows easier to scale.

Common mistakes to avoid

Most LLM-as-a-Judge failures come from design shortcuts, not model limitations.

With that said, avoid these mistakes:

  • Using 1-to-10 scores without anchored definitions. The judge will invent its own scale, and the scale may change across models or prompts.
  • Judging final answers without trace context. You may miss the tool call, retrieval, or policy failure that caused the issue.
  • Treating judge explanations as ground truth. Explanations are debugging aids, not proof that the label is correct.
  • Skipping human calibration. Without human labels, you do not know whether the judge agrees with the standard your team actually uses.
  • Using the same judge for every job. Monitoring, gating, routing, curation, and prompt comparison have different precision and recall requirements.
  • Running expensive judges everywhere. Route by risk, uncertainty, and decision value.
  • Collapsing multiple failure modes into one score. Correctness, grounding, task completion, safety, and efficiency should often be separate evals.
  • Forcing binary decisions when evidence is missing. Add needs_review rather than making the judge guess.

Here’s an uncomplicated fix: define the decision, write the eval, use fixed labels, calibrate against humans, inspect failures in context, and keep measuring the judge over time.

Validate the judge against the job it performs

A judge is good when it supports better engineering decisions.

That sounds obvious, but it changes what you measure. You aren’t trying to prove that the judge is intelligent, but instead prove it’s reliable enough for a specific job.

For a deployment gate, tune the judge to the action it drives. If the judge automatically blocks a release, false positives slow teams down and false negatives ship regressions.

For monitoring, the judge needs stable trend detection. Individual labels can be imperfect if aggregate shifts are meaningful.

For dataset curation, the judge needs useful disagreement routing. It should surface examples your team will need to inspect.

For prompt iteration, the judge needs paired comparison reliability. It should detect whether version B is actually better than version A on the examples that matter.

Measure the judge against that job.

Here are some useful questions to consider:

  • Does the judge agree with humans on the examples we care about?
  • Where does it disagree, and are those disagreements acceptable?
  • Does agreement hold across user segments, domains, and model versions?
  • Does the judge produce stable labels across repeated runs?
  • Does changing the judge model change historical trends?
  • Does the judge catch known regressions in a canary set?
  • Does the judge explanation point to the fix?
  • Does the metric correlate with user feedback, escalation rate, retention, or another downstream signal?

The last question is easy to skip. But pro tip: don’t.

An eval can agree with human reviewers and still be irrelevant to the product. If the judge says “helpfulness improved” but users abandon the flow more often, the eval is measuring the wrong thing or weighting the wrong criteria.

This is why production evals should live near traces, sessions, experiments, and user feedback. The judge label is useful, but the context around the label is what makes it actionable.

The pattern to remember

LLM-as-a-Judge works best when it’s treated as measurement infrastructure, and not a magic grader or a replacement for human judgment. (Or worse, a single score that decides whether your agent is good.)

The strongest systems combine deterministic checks, LLM judges, human calibration, and trace context. Code catches what code can catch, judges handle semantic judgment, and humans calibrate the judges and resolve ambiguity. Observability shows where the judgment came from and what to fix next.

Manual evaluation doesn’t scale. But neither does trusting an untested judge.

That makes the work to design the judge, calibrate it, and keep watching it after it ships critical.

That’s how evals become more than a dashboard number. When done right, they should become part of the agent feedback loop: observe the behavior, measure the failure, fix the system, and verify that the fix held.

The post How to build LLM-as-a-Judge evaluators that hold up in production appeared first on Arize AI.

]]>
What we learned testing 7 models under the same agent harness https://arize.com/blog/what-we-learned-testing-7-models-under-the-same-agent-harness/ Wed, 20 May 2026 17:57:46 +0000 https://arize.com/?p=28528 Model swaps look like configuration changes, but they behave more like product migrations. A new model may be cheaper, faster, easier to get capacity for, or stronger on public benchmarks....

The post What we learned testing 7 models under the same agent harness appeared first on Arize AI.

]]>
Model swaps look like configuration changes, but they behave more like product migrations.

A new model may be cheaper, faster, easier to get capacity for, or stronger on public benchmarks. The API call may barely change. From the outside, swapping models can look as simple as changing a model name in a config file.

But the product question is harder: if you change only the model, does the system still behave the way users expect?

That question matters more for agents than for single-turn prompts. An agent is not just a model call. It is a model operating inside a harness: instructions, tools, schemas, state, retries, rate-limit handling, output expectations, and evals.

We ran the same pi.dev + GitHub CLI agent harness across seven model targets: Sonnet 4.0, Sonnet 4.5, Sonnet 4.6, GPT-5, GPT-5.5, Gemini 3.1 Pro, and Gemini 3 Flash. Only the model changed.

The results were mixed in the way production results usually are. Correctness stayed relatively close across models, but operational behavior moved more: latency, tool-call counts, retry behavior, and timeout risk.

That is the real lesson: a model swap is not safe just because the final answer still looks right. You need to know what the system has to do to get there.

TL;DR

We tested seven models on the same GitHub agent tasks using the same pi.dev + GitHub CLI harness and Arize evaluator setup. The goal was to see what changed when only the model changed.

Key findings:

  • Correctness stayed relatively close, but not identical. The harnessed runs clustered between 79.6% and 85.1% correctness.
  • Operational behavior moved more than final-answer quality. Models differed significantly in latency, tool calls, retries, and timeout risk.
  • Final-answer evals are not enough. Two models can both get the right answer while imposing very different cost, latency, and fragility on the system.
  • The harness appeared to reduce drift, but did not eliminate it. The result supports the stable harness hypothesis, but does not prove the harness caused all of the stability.
  • Model swaps should be treated like migrations. Before routing production traffic to a new model, evaluate both answer quality and the path the agent took to get there.

The experiment

We used a controlled GitHub-ops benchmark. The code and task harness live in acme-agent-evals, and the fixture is a fictional Acme SDK repository seeded with issues, pull requests, labels, milestones, comments, and realistic repo state.

The tasks are the kinds of things a GitHub agent should be able to do:

  • Count open bugs.
  • Find issues with zero comments.
  • Identify PRs without reviews.
  • Compare milestone contents.
  • Audit labels, linked PRs, and issue metadata.

The important part: the harness stayed fixed.

Every model got the same task text, the same fixture repo, the same gh CLI access, the same pi.dev runner, the same GitHub CLI skill, and the same Arize evaluator suite. Only the model target changed.

Main harnessed sweep showing tasks, fixture repo, Pi runner, GitHub CLI skill, model target, and Arize evals
Main harnessed sweep: tasks, fixture repo, Pi runner, GitHub CLI skill, dataset, and scoring stayed fixed while the model target changed.

We ran seven models:

  • Claude Sonnet 4.0
  • Claude Sonnet 4.5
  • Claude Sonnet 4.6
  • GPT-5
  • GPT-5.5
  • Gemini 3.1 Pro
  • Gemini 3 Flash

The main sweep covered 19 read and analysis tasks. Each model ran the full task set ten times.

19 tasks x 7 models x 10 runs = 1,330 attempted examples

The evaluators scored correctness, output quality, efficiency, latency, and tool adherence. Correctness was the primary metric. We also scored output quality, efficiency, latency, and tool adherence. For operational behavior, we put more weight on raw measurements: average latency seconds, tool calls, tool errors, and timeouts.

In Arize, evaluators are customizable checks that return structured results such as a score, label, and explanation. For this benchmark, correctness scored answers against task-specific expected outputs. Output quality used an LLM judge for harder analysis tasks, scoring completeness, accuracy, and organization. Efficiency was a tool-call-count heuristic, latency score bucketed raw latency seconds, and tool adherence checked whether the run stayed on the expected GitHub CLI tool path.

The tables below use 10 runs per model, for 1,330 scored task attempts. Timeouts and failed attempts are included rather than removed.

Arize dashboard showing the clean sweep with seven models, ten runs per model, and 19 examples per run
The clean Arize sweep: seven models, ten runs per model, and 19 examples per run.

What we found

Result 1: correctness moved, but stayed in a relatively tight band

In the harnessed sweep, the seven model targets clustered between 79.6% and 85.1% correctness. Sonnet 4.6 was highest in this run; GPT-5 was lowest. Gemini 3 Flash landed at 82.3%, and Gemini 3.1 Pro landed at 81.1%.

That matters: under the fixed harness, behavior was relatively stable across providers, but the models were not equivalent. This shows that, under a fixed harness, model differences were measurable but contained enough to compare meaningfully.

Harnessed correctness by model chart with labeled bars for Sonnet, GPT, and Gemini model families
Harnessed correctness by model. Bars use a zero-based scale, are labeled by model, and colors group Sonnet, GPT, and Gemini families.
Model Runs Examples Correctness Output quality Efficiency Latency score Tool adherence Avg latency seconds Avg tool calls
Sonnet 4.6 10 190 85.1% 93.9% 98.2% 100.0% 100.0% 9.5 1.45
Sonnet 4.0 10 190 83.5% 94.7% 97.6% 96.4% 100.0% 17.0 2.36
Sonnet 4.5 10 190 83.4% 95.0% 98.1% 97.8% 100.0% 16.1 1.84
Gemini 3 Flash 10 190 82.3% 94.8% 93.4% 96.1% 100.0% 20.2 5.82
Gemini 3.1 Pro 10 190 81.1% 93.2% 95.2% 92.3% 100.0% 28.7 2.64
GPT-5.5 10 190 80.5% 93.2% 93.0% 96.1% 100.0% 18.4 3.39
GPT-5 10 190 79.6% 92.7% 93.1% 87.1% 100.0% 32.2 3.14

The best and worst harnessed model slices were separated by 5.5 percentage points of correctness.

That is the first useful result: in this setup, changing only the model did not completely change answer-level behavior, but it also did not leave behavior unchanged.

It did not mean the models behaved the same. Output quality and efficiency stayed high too, but the larger practical split appeared in operational behavior.

Output quality and efficiency chart for the seven harnessed model targets
Output quality and efficiency stayed high across the harnessed sweep. These are supporting evaluator scores; correctness remains the primary quality metric.

Result 2: final-answer correctness hid larger operational drift

The bigger differences showed up in how the models got to the answer.

Sonnet 4.6 averaged 9.5 seconds per task and 1.45 tool calls. GPT-5 averaged 32.2 seconds and 3.14 tool calls. GPT-5.5 was faster than GPT-5 at 18.4 seconds, but still averaged 3.39 tool calls. Gemini 3 Flash landed at 20.2 seconds and used the most tools at 5.82 calls per task. Gemini 3.1 Pro averaged 28.7 seconds and 2.64 tool calls.

Operational behavior by model chart showing average latency seconds and average tool calls
Operational behavior still carried a model signature. Both panels use zero-based axes: average latency seconds on the left, average tool calls per task on the right.

That matters in production.

A user may see the same final answer while the system absorbs very different latency, retry risk, timeout risk, and tool traffic. If your eval only checks final correctness, you will miss that difference.

For example, two models can both answer a milestone-comparison task correctly. One may get there with a single targeted gh query. Another may issue several redundant issue and PR lookups before producing the same answer. Both runs may pass a correctness check, but they have very different cost, latency, and failure exposure.

This is the part of model drift that is easy to miss. It does not always show up as “the answer is wrong.” Sometimes it shows up as “the product is slower, more expensive, more fragile, or more dependent on retries.”

Result 3: Tool adherence saturated

Tool adherence saturated across the clean model-behavior rows. That is good news. It means the models generally stayed on the allowed tool path in the harnessed run. It is also a warning.

Once a metric nearly saturates, a binary “did it use tools?” check stops telling you much.

The next question is tool discipline:

  • Did the model choose the smallest useful gh query?
  • Did it avoid unnecessary retries?
  • Did it handle pagination and empty states?
  • Did it transform JSON deterministically?
  • Did it use the tool when it should, instead of asking the user for help?

Tool adherence is a guardrail. It is not the whole eval.

Then we removed the harness

To check whether the harness itself was doing useful work, we ran the same 19 tasks through a raw baseline: direct model API calls with a small JSON protocol for requesting safe read-only gh commands.

This was not meant to compare pi.dev against raw API calls as products. It was a stress test: how much behavior changes when the task stays fixed but the agent scaffold is reduced?

This comparison is intentionally imperfect. The raw runner still had access to GitHub, but it did not get the full agent harness and received less scaffolding around how to ask for tools. That makes it useful as a stress test, not a final verdict on raw model calls. It also allowed us to ask a narrower question: what changes when the task stays the same, but most of the agent scaffold goes away?

Model Pi correctness Raw correctness Pi tool adherence Raw tool adherence Raw runner errors
Sonnet 4.6 85.1% 77.2% 100.0% 98.9% 0
Sonnet 4.0 83.5% 80.4% 100.0% 96.8% 0
Sonnet 4.5 83.4% 77.7% 100.0% 92.6% 0
Gemini 3 Flash 82.3% 73.6% 100.0% 98.4% 8
Gemini 3.1 Pro 81.1% 71.7% 100.0% 90.5% 4
GPT-5.5 80.5% 74.0% 100.0% 94.7% 1
GPT-5 79.6% 72.5% 100.0% 99.5% 0

Raw correctness was lower for every model in this run. The drop was small for Sonnet 4.0 and larger for Gemini 3.1 Pro, Gemini 3 Flash, GPT-5, and Sonnet 4.6. The raw runner also exposed a different failure mode: 13 task attempts exceeded the eight tool-round limit, mostly on Gemini.

That does not mean raw model calls are always bad, or that this harness is better for every workload. It means this benchmark behaved more reliably when the model operated inside a fuller agent harness. One likely reason is that the harness reduced the amount of protocol design each model had to rediscover on every task.

What this means

The result is consistent with the stable harness hypothesis. It does not prove that the harness caused all of the stability we observed.

The careful version: In this read-task benchmark, a fixed pi.dev plus gh skill harness kept seven model targets in a relatively tight correctness band. It did not eliminate drift. Model choice still mattered, and much of the visible spread showed up in operational behavior: latency, tool calls, timeouts, recovery, and command efficiency.

That is still useful.

For product teams, model upgrades should be treated like migrations, not vibes. You need to know whether the system still behaves acceptably after the model changes. That means measuring both final-answer quality and the path the agent took to get there.

If you measure only correctness, you may conclude the models are interchangeable. If you measure only latency, you may miss that a slower model is more complete. The useful view is both:

Did the model get the job done, and what did the system have to do to make that happen?

What to do before you swap models

The practical lesson is to treat model swaps like migrations.

  • Freeze the harness: same tools, prompts, fixture, dataset, and scoring.
  • Run enough repeats to separate signal from noise.
  • Score both final-answer quality and operational behavior.
  • Inspect task-level failures instead of stopping at the average.
  • Use the failures as the migration plan before changing production traffic.

Stable harnesses need stable measurements. When scores move, the tempting story is “the model changed.” Sometimes the task changed. Sometimes the fixture changed. Sometimes the judge was underspecified. If the task, fixture, or scoring changes underneath the model, you are no longer measuring model drift cleanly.

Can you swap models safely? Yes, sometimes. But only when the eval says the behavior still meets the product bar. A model card can tell you what changed in the model. Your traces, evaluator results, and experiment comparisons tell you what changed in your product.

The post What we learned testing 7 models under the same agent harness appeared first on Arize AI.

]]>
Building a self-improving agent on a context graph of human disagreement https://arize.com/blog/self-improving-agent-with-context-graph Tue, 19 May 2026 15:02:12 +0000 https://arize.com/?p=28493 You can build a measurably better agent from data you already have, without retraining a thing. The data is what your experienced humans do when they correct the AI. Capture...

The post Building a self-improving agent on a context graph of human disagreement appeared first on Arize AI.

]]>
You can build a measurably better agent from data you already have, without retraining a thing. The data is what your experienced humans do when they correct the AI. Capture those domain-knowledge based corrections as a context graph, mine them for patterns, and the agent steadily matches what the humans actually do.

Every AI agent deployed inside an enterprise sometimes quietly disagrees with the humans running the same process. The written policy says one thing. The institutional knowledge sitting in Slack threads, hallway conversations, and the heads of long-tenure employees says another.

The agent is correct against policy and wrong against reality. The reviewer is right because they hold context the system of record doesn’t. Most teams treat the gap between them as noise. It’s the most useful signal in the system.


Cartoon of a customer at a spare-parts counter being told "I don't care if you can see it on the shelf, the AI agent says we don't have any"; the word "computer" is crossed out and replaced with "AI agent"

To prove it, we built a procurement agent. Ran 130 purchase requests through it. A simulated reviewer with years of institutional knowledge disagreed with the agent on 60 of those decisions: a 53.8% baseline match rate. After four cycles of mining the disagreements and feeding the patterns back as runtime config, the agent matched the reviewer on 108 of 130, an 83.1% rate. No source-code changes. No fine-tuning. Just structured capture of human overrides and a loop that reads them.

We covered the broader case for context graphs in an earlier post. This post shows what they look like when you build one.

What we built

We built three components for this demo:

  • Procurement agent: simulates a real-world procurement system managed by an agent. Users submit purchase requests; the agent reviews them against its process documents and approved vendor list. Traces go to Arize AX.
  • Vera Fye, a human reviewer: a simulated human reviewer with institutional knowledge. She reviews the agent’s decisions and overrides where necessary, and the overrides feed back through the procurement agent so they’re traced too.
  • Mining agent: a Claude Agent SDK tool wrapping agent skills that extract traces from Arize AX and propose improvements to the procurement agent.

Self-improvement loop diagram: procurement agent decides, human reviewer overrides; both flow as Arize traces and annotations (the context graph) into a mining step that produces a report of clusters and proposed diffs, which feed back as updated instructions to the agent

The demo is available for you to explore on GitHub.

The procurement agent

Screenshot of the procurement-agent UI: a sidebar of approved requests on the left, and a detail pane on the right for a $37,726 CloudBase Inc capacity-expansion request showing the agent's "Approved" decision and an override form

The procurement agent is a Python agent built with LangChain, with a Next.js UI. When the agent launches, it loads a set of process documents that details the rules for the procurement process, such as which vendors are already approved, price thresholds, and so on.

At runtime the agent calls three tools (check_policy, lookup_vendor, check_budget) to gather context before deciding. Each vendor record has a status (preferred / approved / suspended / not_listed), categories, notes, and extension fields (cost-overrun factor, relationship credit) that default to inert values. Approvals tier by amount: auto-approve below $5K, manager approval up to $50K, VP approval above. Requests come from one of five departments (Engineering, Marketing, Sales, Customer Success, Security).

The agent’s output is a recommendation (approve / reject / flag-for-review) plus a confidence level (high / medium / low), and all the traces for one request group into an Arize AX session.

Once the agent and UI is running, you can manually add procurement requests to be reviewed by the agent, or you can use the seed_requests script to upload 130 sample requests.

Every step in the agent flow is traced: the policies it followed, its reasoning, its final decision.

Arize trace view for purchase request PR-130: trace tree on the left showing the process.run root span with nested LangGraph, ChatAnthropic, and check_policy / lookup_vendor / check_budget tool spans; structured output on the right showing policy_compliance, policy_details, budget_status, and vendor_status fields

Based on the knowledge the agent has, here’s a summary of how the agent processes the sample data:

Recommendation Count %
Approved 57 43.8%
Flagged for review 42 32.3%
Rejected 31 23.8%
Total 130 100%

Vera Fye, the reviewer

If you’ve read Gene Kim’s The Phoenix Project, you remember Brent Geller. The senior engineer who knows how every system actually works, who handles every escalation, who is the bottleneck because all the operational knowledge lives in his head. The Phoenix Project plot exists because Brent is undocumented institutional memory walking around in a person.

Brents are real. Every company has them: the long-tenure engineer, the admin who knows everyone, the operations person you ask before doing X. The artifact is the same: rules and context that aren’t written anywhere readable live in this person’s head.

For our demo, Vera Fye simulates the company’s Brent. She carries the undocumented knowledge: which vendor’s CEO plays golf with the enterprise’s CEO, which companies have a history of billing disputes, which departments tend to panic-buy. When she overrides the procurement agent’s decision, the key isn’t the override itself; it’s that we capture it in the same trace session as the original agent decision.

Screenshot of the procurement-agent UI showing Vera Fye's override on a $20K Customer Success unfamiliar-vendor request: she approves, citing the customer-success-understaffed precedent and attaching conditions (90-day trial cap, full vendor approval before any renewal)

Arize trace view of the override.run span on PR-086: trace tree on the left, agent output on the right showing recommendation = approve with reasoning that explicitly applies the customer-success-understaffed precedent and the conditions Vera attached

Vera reviewed every request the agent flagged, made the call on each, overrode some of the agent’s clear-cut decisions, and flagged a handful of clean approvals that she felt needed VP sign-off. In total, she made 60 overrides out of 130 agent decisions:

Agent decision Vera decision Status Baseline 
approve approve unchanged 44
approve reject changed 10
approve flag-for-review changed 3
reject approve changed 5
reject reject unchanged 26
reject flag-for-review changed 0
flag-for-review approve changed 27
flag-for-review reject changed 15
flag-for-review flag-for-review unchanged 0
Total agreements (unchanged) 70
Total disagreements (changed) 60

This means our agent has a success rating of 53.8%. Not great.

Our traces now hold the original decision and the human review side by side. The institutional knowledge that used to live only in Vera’s head is now data we can mine.

The context graph

The architectural shift here is that we don’t have a logging table, we have a structured context graph:

  • Nodes: requests, decisions, vendors, departments, precedents, dollar bands.
  • Edges: the agent recommended X citing Y, the reviewer overrode to Z citing precedent P, vendor V is preferred over W under directive D.

Concretely, the graph lives in Arize AX. Every request becomes a session, every agent run is a process.run span, every Vera review is an override.run span, and the reviewer’s reasoning, precedent tag, and conditions sit on each override span as structured annotations.

There is no separate graph database. The graph is what you get when you join sessions to their annotations and treat each span as a node and each session-membership or annotation as an edge. The mining agent queries it back the same way: export spans by session, group by precedent, look for clusters.

Context-graph diagram for a single request (PR-024, $4,500, Engineering, DataStream Analytics). The request node sits in the centre, linked to the agent's recommendation (approve, high confidence), the reviewer's decision (reject by Vera Fye), the vendor (DataStream Analytics, cost-overrun factor 1.4×), the department (Engineering, sandbags ~30%), and the precedent the reviewer cited (datastream-cost-overrun)

A row in a database tells you Vera said “reject”. The graph tells you Vera rejects DataStream Analytics requests under $5K when the justification doesn’t name a specific deliverable, because DataStream’s quotes inflate 40% in implementation. That second sentence is the asset. It’s a piece of decision logic the company has that none of its systems of record contain.

Mining the graph

130 sessions is enough to surface patterns in the human overrides. The job of the mining agent is to find them.

In this demo, the mining agent is an agent skill with a runner that invokes it via the Claude Agent SDK (or can be run manually through a coding agent like Claude Code). The skill leverages the Arize skills to extract traces and look for patterns.

Here are some of the patterns the mining agent surfaces.

Precedent Count Pattern Why
marketing-panic-buy 14 always reject Marketing keeps buying single-campaign tools that get abandoned after launch.
customer-success-understaffed 14 always approve CS is chronically understaffed; their urgent requests are usually genuine.
datastream-cost-overrun 14 mostly reject DataStream’s quotes inflate ~40% on implementation.
vertex-march-outage-goodwill 8 always approve Vertex extended emergency pricing during the March outage; Vera carries it as credit.
cloudbase-cto-relationship 7 never reject The CTO has a personal relationship with CloudBase’s CEO.
cfo-vendor-consolidation 5 reject (3), approve (2) Q2 directive: consolidate to one vendor per category.

The first five rows are surfacing institutional knowledge. The process documents that the procurement agent uses don’t have any detail about the customer success team being understaffed, or the CTO’s relationship with CloudBase’s CEO. By mining the overrides, we’re converting that institutional knowledge into data that can be fed back to the agent.

The last row is also interesting. This directive is already returned by the check_policy tool on every single tool call, but appears to be ignored. The agent has the text in front of it and still ignores it. The mining report calls this out as a meta-finding:

“the policy is present in the tool output but invisible in the agent’s output”

Vera is catching documented process issues where the agent failed to follow its rules. That’s a different shape of fix: not “teach the agent something new” but “make the agent surface what it already knows.”

Fix the agent using the context graph

The mining agent is built on skills invoked via the Claude Agent SDK. It sits in a harness that extracts traces, mines patterns, and proposes updates to the system prompt instructions for the procurement agent and to the policy documents. No new tools are added, instead the same lookups return richer data.

The ground truth for the loop comes from Vera’s overrides. Where the agent agreed with her, that’s a known-correct answer. Where she overrode, her decision is the correct one. Every variant we run gets scored against the same 130 requests, and the score is how often it matches Vera. We iterate until the score stops climbing.

Rather than change the demo app’s source, our mining agent writes its proposed updates into an experiments folder, and the procurement agent picks them up at startup via an environment variable. Each cycle gets its own Arize AX project, and its own score against Vera’s baseline.

Cycle 1

We pointed the mining agent at the 130 baseline traces and Vera’s 60 overrides. It proposed the following changes:

  • Two procedural rules added to the agent’s instructions. Cite the Vendor Consolidation Directive by name whenever the vendor comes back as not listed, and drop confidence to medium whenever the recommendation is flag-for-review.
  • Vendor policy updates. Set a 1.4× cost-overrun factor on DataStream (reject-by-default with three carve-outs), a “March-outage goodwill” credit on Vertex (approve in the $5K to $50K manager-approval band), and a “never reject, flag-don’t-reject” CTO-relationship credit on CloudBase.
  • Department policy updates. Marketing and Sales get a panic-buy reject rule: when the request is a tool, license, or plugin tied to a single campaign with no recurring-revenue case, reject regardless of amount or vendor status. Customer Success gets an understaffed-approve rule: their urgent requests are usually genuinely urgent, so approve in the manager-approval band when urgency is urgent or emergency, or when the justification mentions understaffing or a migration. Engineering gets a sandbag-floor: their quoted amounts run ~30% low in practice, so treat the figure on the request as a lower bound rather than the actual spend.

These suggestions need a human review pass before they ship to production. Some belong in the permanent process documents. Others, like the note that Customer Success is currently understaffed, capture this quarter’s reality and probably belong in a time-boxed config rather than the long-term policy.

Running the agent with these updates gives a better result:

Agent decision Vera decision Status Baseline Cycle 1 Δ 
approve approve unchanged 44 55 +11
approve reject changed 10 2 −8
approve flag-for-review changed 3 1 −2
reject approve changed 5 8 +3
reject reject unchanged 26 42 +16
reject flag-for-review changed 0 0 0
flag-for-review approve changed 27 13 −14
flag-for-review reject changed 15 7 −8
flag-for-review flag-for-review unchanged 0 2 +2
Total agreements 70 99 +29
Total disagreements 60 31 −29

Our agent now has 76.2% agreement with Vera’s original overrides. Not perfect, but closer. Nothing about the agent’s source code changed. We just gave it back what Vera knew.

Cycle 2, 3, and 4

We ran the same cycle three more times. Each pass moved the agent closer to Vera’s baseline, but the moves shrank: cycle 2 added 6%, cycles 3 and 4 added less than one between them. The agreement curve flattened around 83%, which is the diminishing-returns shape any loop of this kind settles into.

What changed across the four cycles was the shape of the work. Cycle 1 was teaching: handing the agent rules it didn’t have. The biggest single jump in the next three cycles, cycle 2’s +6%, wasn’t a new rule at all. It was an explicit allowlist telling the agent when it was allowed to flag for review. That one change resolved 22 of the 31 over-flag disagreements from cycle 1. The agent already had enough information to make those calls without escalating. It just lacked permission.

The standard instinct when an agent is wrong is to add more context, more rules, more examples. The fix here was the opposite. From cycle 2 on, the work was tuning rather than teaching: sharpening the edges of rules that were already there. An over-cautious agent doesn’t need more rules. It needs to know when it’s allowed to decide.

Line chart of agreement rate across cycles: 53.8% at baseline, climbing steeply to 76.2% at cycle-1, then plateauing around 82 to 83% through cycle 4

The general pattern

Strip out the procurement specifics. The same seven steps ran every cycle:

  1. Run your agent, and get humans to review with their institutional knowledge
  2. Capture the full context graph: agent decision, its reasoning, the human override, the reviewer’s reasoning, all as traces in the same session
  3. Extract the traces for both the agent and the human review
  4. Look for patterns in the agreements and disagreements
  5. Update the agent in a sandbox and re-run it
  6. Validate against the previous human reviews, or run another review pass
  7. Iterate until the curve flattens

The agent is unlikely to ever hit 100% alignment with a competent human. Some of that ceiling is irreducible: humans aren’t perfectly consistent with themselves either. In our demo, the simulated reviewer Vera flipped direction on the CloudBase cluster between cycles, citing the same precedent both times. The judgment calls at the edges don’t compress cleanly into rules, no matter how many cycles you run.

Push too hard on one sample set and you overfit. The honest version of this loop is continuous, not one-and-done. Vendors get acquired, policies change, and last quarter’s institutional knowledge becomes this quarter’s open question. The loop has to run against fresh traces on a cadence, or you’re just measuring a snapshot.

The pattern holds anywhere an agent runs against a policy and humans run against reasons: sales lead routing, support triage, content moderation, fraud review.

The seven steps above are what it looks like to treat the gap between policy and reality as the signal it is. Most teams collect the same data and throw it away.

Don’t throw your data away

The full demo is at github.com/Arize-ai/context-graphs. Set up an Arize AX account, run it end-to-end, and watch your own version of the curve climb.

If you’re shipping agents into a real workflow where humans review them, you have the raw material already. The override sitting in a Slack thread, the comment correcting the AI’s suggestion, the manual fix after the recommendation, those are signals your system gives you for free. Structure them as traces, mine them for clusters, and the curve climbs without retraining or fine-tuning anything. The hard part isn’t the data. It’s deciding to treat it as data.

The post Building a self-improving agent on a context graph of human disagreement appeared first on Arize AI.

]]>
Coding agent tracing and evaluation: An open source tool to improve AI coding workflows https://arize.com/blog/open-source-coding-agent-tracing/ Mon, 18 May 2026 14:15:03 +0000 https://arize.com/?p=28473 Announcing coding harness tracing for observing, evaluating, and improving coding agent workflows across Claude Code, Cursor, Codex, GitHub Copilot, and Gemini CLI.

The post Coding agent tracing and evaluation: An open source tool to improve AI coding workflows appeared first on Arize AI.

]]>
Coding agents are useful, but they’re still hard to debug.

You give Claude Code a task. It reads files, edits code, runs commands, retries after errors, and eventually returns a patch. Sometimes it works. Sometimes it fails in a subtle way. Either way, it’s hard to see what actually happened.

Today we’re launching coding-harness-tracing, an open source project for tracing coding agent workflows across Claude Code, Cursor, Codex, GitHub Copilot, and Gemini CLI.

With harness tracing, you can inspect each run step by step: files read, tools called, commands run, retries, token usage, latency, and final outputs. From there, you can compare prompts, find wasteful workflows, build reusable skills, and measure which models or harnesses produce better results on your codebase.

Traces can be sent to Arize AX or Phoenix for inspection, replay, evaluation, experiments, and dashboards.

With coding harness tracing, you can now address those opportunities on the foremost agent development platform in the world.

TLDR
Coding agents are still hard to debug. We’re releasing an open-source tracing tool that lets developers inspect and evaluate coding-agent workflows across Claude Code, Cursor, Codex, Copilot, and Gemini CLI — including prompts, tool calls, retries, latency, token usage, and errors — so teams can systematically improve AI coding workflows.

What is coding harness tracing?

The open-source coding-harness-tracing project is free and instruments coding harnesses such as Claude Code, Codex, Cursor, GitHub Copilot, and Gemini CLI so you can capture agent steps, tool calls, prompts, responses, latency, token usage, and errors.

(We have documentation for getting started with your coding agent of choice.)

Screenshot of a tracing dashboard showing coding-agent projects being monitored and evaluated. The interface lists projects including Claude Code, Gemini, Codex, Copilot, and Cursor, along with trace volume, tags, and creation dates. A “New Tracing Project” button appears in the top right.

Those traces can be sent to either Arize AX platform or a self-managed instance of Phoenix, where you can inspect runs, build datasets, run experiments, evaluate behavior, replace agent sessions, and create dashboards for tracking improvements over time.

Screenshot of a tracing dashboard showing coding-agent projects being monitored and evaluated. The interface lists projects including Claude Code, Gemini, Codex, Copilot, and Cursor, along with trace volume, tags, and creation dates. A “New Tracing Project” button appears in the top right.

What you can inspect in a trace

A trace turns a coding agent session into a sequence of observable steps:

  • the original user prompt
  • model responses
  • file reads and edits
  • shell commands
  • tool calls
  • MCP server interactions
  • errors, retries, and dead ends
  • latency, token usage, and estimated cost

That helps you move beyond the basic question of “did the agent work?” and ask:

  • Did it read the right files before editing?
  • Did it run tests after making changes?
  • Which tool calls were unnecessary?
  • Where did it repeat itself?
  • Did a different model or prompt reduce retries?
  • Did the workflow improve across multiple runs?

Improving your coding agent workflow

Individual developers can use traces to understand how their coding agents work across real tasks. This is especially useful when comparing prompts, models, skills, tools, and MCP servers. By capturing every step the agent takes, coding harness tracing makes it possible to answer questions like:

  • Which of the tool calls are actually necessary?
  • Where is ambiguity causing my agent to hallucinate or bleed tokens?
  • What repeated workflows can become reusable skills?
  • Which coding harness/model combination performs best on correctness, latency, and token usage?
  • Where should you add instructions, tests, or guardrails?

Diagram showing a workflow for tracing, evaluating, and optimizing coding-agent skills using Claude Code and Arize AX. The flow illustrates harness traces, evaluations, datasets, experiments, and skill optimization loops that compare skill versions, identify poor-performing workflows, and deploy improved skills back into the coding-agent environment via GitHub pull requests.

In Arize, you can collect datasets from specific operations or tool calls, such as shell commands, file edits, or command executions, then run side-by-side experiments to compare workflow changes. For example, you might collect traces from several bug-fix tasks and compare:

  • Claude Code vs. Cursor on the same task set
  • a short prompt vs. a structured prompt
  • with and without a test-running MCP server
  • with and without a project-specific skill

This is a relatively new and unexplored domain, so these are just examples. The insights you can capture with tracing data are solely dependent on how you make use of the tools. (For instance, you can track hypothetical cost of usage against token pricing since Arize hooks into the underlying APIs and pricing tables.)

From there, dashboards can help track practical metrics over time: latency, token usage, tool-call volume, estimated cost, retries, and correctness.

Screenshot of an analytics dashboard for coding-agent workflows showing request volume, token usage, response counts, and estimated cost over time. Charts visualize tool-call frequency and tool-call relevance across actions such as file reads, Bash commands, edits, searches, and agent interactions for Claude Code sessions.

From individual workflows to team patterns

Once multiple engineers trace coding-agent sessions into the same project, teams can start identifying shared workflow patterns:

  • which prompts or skills consistently work
  • which workflows waste tokens or time
  • where agents fail across similar tasks
  • which tools improve correctness
  • which practices should become team-wide defaults

This is most useful when treated as an engineering feedback loop as part of a DevEx practice with goals like improving shared workflows, building reusable skills, and increasing evaluation coverage.

Diagram illustrating the prompt and context layer for coding-agent workflows. The chart maps inputs developers control, traces and spans emitted during agent execution, evaluation methods applied to traces, and feedback-loop actions used to improve prompts, skills, tools, and workflows over time using observability and evaluation data.

Coding agents are becoming part of the software development stack. That means they need the same engineering discipline as any other system developers rely on: observability, evaluation, experimentation, and iteration.

With coding-harness-tracing, developers can move beyond anecdotal “this prompt felt better” debugging and start improving coding-agent workflows with traces, datasets, experiments, and dashboards.

Get started

Start by tracing one real workflow: a bug fix, refactor, test-generation task, or documentation update. Then inspect the trace, identify one failure mode, and run the same task again with a changed prompt, model, skill, or tool configuration.

Setting things up takes minutes. Documentation is available for each supported harness:

Want to try it yourself? Explore coding-harness-tracing on GitHub

The post Coding agent tracing and evaluation: An open source tool to improve AI coding workflows appeared first on Arize AI.

]]>
How we use Alyx to build Alyx: How to build an AI agent feedback loop https://arize.com/blog/ai-agent-feedback-loop-arize-alyx/ Wed, 13 May 2026 14:58:08 +0000 https://arize.com/?p=28424 How Arize uses Alyx to debug Alyx: searching dense traces, aggregating failures, triaging dogfooding issues, and closing the AI engineering feedback loop.

The post How we use Alyx to build Alyx: How to build an AI agent feedback loop appeared first on Arize AI.

]]>
For the past two and a half years, we’ve been building Alyx with a singular goal: create the Cursor experience for AI engineers. Alyx is an agent that helps teams make their agents better. Being AI engineers building Alyx ourselves gives us a unique advantage—we’re building the tools we actually want and need, not just features that demo well.

Every day, we use Alyx to build Alyx. Alyx helps us analyze, debug, evaluate, and iterate on Alyx at a much faster rate.

More of a visual learner? We have you covered. 👇

Why manual trace debugging breaks down

When you’re developing an AI agent, your workflow typically looks something like this: you have your agent UI on one screen and your traces on another. When something goes wrong, you dive into the trace to understand what happened.

The problem with debugging traces manually comes from the same thing that makes traces so powerful: how information dense they are. While the traces we showed in our demo might look intricate, they’re actually on the simpler side. Some of our customers have traces with hundreds of spans, with hundreds of JSON entries per span. Expecting anyone to manually comb through that data to diagnose an issue is unrealistic.

That’s exactly why we built the lyx Trace Debugger.

Workflow 1: Search across dense traces

Diagram showing how Alyx helps engineers navigate information-dense AI agent traces. On the left, a single trace contains 70 spans, nested JSON, and large LLM and retrieval spans with hundreds of thousands of characters. On the right, Alyx tools including find_in_trace, get_span_data, jq/grep-json, and get_trace_preview enable regex search, structured JSON queries, full span inspection, and compressed trace summaries. The diagram highlights common debugging questions Alyx helps answer, including prompt usage, tool arguments, expected outputs, error origins, and value flow across tool calls.

A single trace from one of our Alyx sessions contains 70 spans and weighs in at over 10MB of JSON. One LLM span alone — a single orchestrator iteration — is 200,000 characters: 62,000 characters of input messages, 50,000 characters of prompt template, and 75,000 characters of tool definitions. The largest span in that trace, a data retrieval call, is 961,000 characters by itself. Across all 70 spans, there are 40 distinct non-null attributes per span — things like attributes.llm.input_messages, attributes.llm.tools, attributes.metadata, eval.arnav-tool-calling-task-.explanation, token counts, cost breakdowns, graph node IDs, session IDs, and more.

This is what “information dense” actually means in practice. It is not an abstract complaint. A developer staring at this trace in a UI is looking at megabytes of nested JSON. Even with a good tree view, finding the thing you care about requires knowing where to look.

The Alyx traces agent has tools designed specifically for this: find_in_trace performs a regex search across every column of every span in a trace and returns matching cells with their span IDs and column names — like Ctrl+F across a spreadsheet. get_span_data pulls the full attribute set for a specific span. jq and grep_json run structured queries against any JSON blob stored in memory. And get_trace_preview gives a compressed overview of the entire trace with latency contributions calculated for each span.

The kinds of questions this handles:

  • “What system prompt was this agent using?” In a trace with 22 LLM iterations, the system prompt is embedded inside the attributes.llm.input_messages of each LLM span. That prompt is 50,000 characters long. Alyx can use find_in_trace with a search term — say, the name of a specific instruction or a known phrase — and pinpoint the exact span and offset where it appears. Without this, you are manually expanding each LLM span and scrolling through the messages array.
  • “Which tool was called with what arguments?” In the 70-span trace, there are 47 tool spans. Alyx ran jq queries like [.spans[] | select(.["attributes.openinference.span.kind"] == "TOOL") | .name] | unique to extract the distinct tool names (home_page_agent, documentation_search, link_to_page, finish) across multiple traces. Getting this by hand means opening each tool span individually.
  • “What did the agent output at iteration N?” When debugging multi-turn agent behavior, you often need to see what the agent said or did at a specific step. Alyx can query specific iteration spans to extract text content, tool calls, and their arguments — all without navigating a deep span tree manually.
  • “Where does this specific error message appear?” find_in_trace searches all columns, including event.attributes (which contains exception tracebacks), attributes.output.value, and attributes.error.message. If you know a fragment of an error string, Alyx will return every span and column where it shows up.
  • “How did a value propagate across tool calls?” In agent traces, the output of one tool often becomes the input to the next. Tracing how a specific value — a span ID, a filter string, a dataset name — moves through the execution graph normally requires cross-referencing multiple spans. Alyx can search for that value across all spans and show you the chain.

The point is not that these questions are impossible to answer manually. They are not. The point is that when a single span is 200K characters and there are 70 of them, the time cost of manually navigating that data is high enough that you often just don’t bother asking the question. Alyx removes that friction.

(Side note: we used Alyx itself to find the specific traces referenced in this blog post. We asked it to search for interesting traces with high span counts and diverse tool usage, and it returned the ones we used as examples here. That is the kind of retrieval task where the time savings compound.)

Workflow 2: Trace aggregation and multi-level analysis

Diagram illustrating Alyx’s cross-trace pattern discovery workflow for debugging AI agents at scale. The pipeline has four stages: Aggregate, where SQL-style grouping identifies patterns across traces; Categorize, where LLMs cluster free-text errors into categories; Build Dataset, where matching traces are converted into labeled datasets for analysis; and Experiment, where prompt or system changes are tested against the dataset. A parallel annotations workflow supports bulk labeling of traces by latency or quality. The diagram emphasizes an iterative workflow for finding patterns, understanding failures, isolating examples, and validating fixes.

Single-trace inspection is useful when you know which trace has the problem. But most of the time, you have a time range, a vague sense that something is off, and a need to figure out where to look.

Aggregation and grouping. Alyx’s compute_aggregations tool works like a SQL GROUP BY over trace data. In one session, we asked: “what is the average cost of my traces in the last month?” Result: $0.023 per trace. One tool call, under 15 seconds. The real value shows up with grouping — “average latency by model name,” “error count by span kind,” “token usage by agent type.” In the home page agent analysis, Alyx ran three consecutive aggregations: root spans by status_code (38 OK, 1 UNSET), child tool spans by name and status_code (all 988 child spans UNSET), then error spans by name. The pattern discovery that took 5 minutes of automated analysis would have taken much longer scrolling through a table.

Semantic categorization. Not all fields are low-cardinality strings you can GROUP BY. Error messages and user inputs are free text. extract_categories samples text from a column and uses an LLM to identify 3-10 mutually exclusive categories. assign_categories then classifies every span into those categories via batched LLM calls. This is what powers error triage: 68 error traces with verbose exception strings get clustered into actionable categories like “GraphQL Object Not Found” and “Content Policy Violation” with counts for each.

Building datasets from traces. Once you’ve identified an interesting subset of traces, create_dataset_from_spans moves them into a dataset — the bridge between trace analysis and experimentation. You find 50 traces where the agent handled a query type poorly, create a dataset from them, then test prompt changes with run_experiment.

Annotations. We asked Alyx to “create an annotation config for speed/latency and annotate my last 10 traces.” It created a categorical config with Fast/Medium/Slow labels, examined the latency values, and applied annotations to each trace — all in a single conversation turn. Useful for building labeled datasets for evaluation or flagging traces that need human review.

The pipeline. These tools compose: aggregate to find patterns, categorize to understand them, build a dataset to isolate them, experiment to fix them. Each step feeds naturally into the next, and Alyx holds the context of what it found at each step to carry forward.

Workflow 3: Triage dogfooding failures from trace IDs

Diagram showing how Alyx reduces the debugging burden during internal AI agent dogfooding. Engineers submit trace IDs and expected behavior without needing to provide root-cause analysis. Alyx automatically analyzes each trace by inspecting spans, reading exception tracebacks, and categorizing errors across traces. The system aggregates failures into prioritized fix clusters, such as “Object Not Found” and “Content Policy Violation,” with counts to guide triage. The bottom section contrasts the previous workflow, where every engineer debugged issues individually over hours, with a new batched workflow where a small team triages issues using Alyx in minutes.

We recently ran a dogfooding session with almost our entire engineering team using Alyx. The setup was simple: use the product, and when something breaks, log the trace ID and what you were trying to do. That was it — no debugging, no root cause analysis, no “here’s what I think went wrong.” Just the trace ID and a description of the expected behavior.

This was deliberate. Debugging is expensive engineer time. Logging a trace ID is not. We could have asked every engineer to dig into their own failures, but that would have meant pulling people away from their actual work to do ad hoc trace analysis. Instead, we collected the trace IDs and handed the debugging to Alyx.

What we were looking at afterward: 68 traces with errors from production Alyx. For each one, we could ask Alyx directly — “what went wrong in this trace?” — and it would walk through the spans, examine the tool calls and LLM outputs, read the exception tracebacks, and explain the failure. For a single trace, that is already faster than doing it by hand. Across 68 traces, the difference is hours versus minutes.

At the aggregate level, we asked Alyx to categorize the errors. It identified which column contained the error data (event.attributes), used semantic parsing to read the verbose exception tracebacks (these are not clean strings you can GROUP BY), extracted meaningful categories, and returned counts. “Object Not Found” errors (GraphQL-related) dominated and became the immediate fix target. Content policy violations — where our LLM provider was flagging some of our own prompts as potential jailbreak attempts — surfaced as a separate category.

The key insight is operational: you don’t need every engineer who hits a bug to also be the one who diagnoses it. You need them to capture the trace ID. The investigation can happen later, by fewer people, with Alyx doing the trace-level analysis. That changes the economics of dogfooding from “everyone debugs their own issues” to “everyone reports, a small team triages.”

What this changes about agent development

This workflow—trace analysis to root cause identification to fix—still involves some manual handoffs. You go to the UI, find the ent’s traces, debug their execution, iterate and optimize your prompts, run multiple evals – while coding agents like Cursor and Claude Code have a unique set of data (code) and tools to look at your code – we’re working towards a future where both agents can talk to each other.

Key takeaways

  • Traces are too dense for manual inspection at scale. A single span can be 200K characters of JSON. When you have 70 of those in a trace, and hundreds of traces in a time range, you need tools that can search, query, and aggregate across that data programmatically.
  • Aggregate first, then drill down. The most efficient debugging workflow is not opening traces one by one. It is computing aggregations across your traces to find patterns, categorizing errors semantically, and only then drilling into specific spans. Alyx’s tool chain is built around this sequence.
  • Separate reporting from debugging. In our dogfooding sessions, we learned that the highest-leverage setup is having many people report failures (just a trace ID and what they expected) and a small team triage them with Alyx. You do not need every engineer who hits a bug to also diagnose it.
  • Dogfood with your own tools. We use Alyx to analyze Alyx’s traces, debug Alyx’s prompts, and categorize Alyx’s errors. This is not a philosophical stance — it is how we find the gaps. If our trace debugger cannot diagnose our own agent’s failures, it is not ready for anyone else’s.

Want to see how Arize helps teams build evals for AI agents?

Book a demo >

This is the final part four of a four-part deep dive series on how we built Alyx.

The previous posts are here if you missed them:

The post How we use Alyx to build Alyx: How to build an AI agent feedback loop appeared first on Arize AI.

]]>
Models got an order of magnitude better at following instructions in one year https://arize.com/blog/llm-instruction-following-benchmark-2026/ Tue, 12 May 2026 14:45:17 +0000 https://arize.com/?p=28407 A year ago, frontier models started losing track of instructions somewhere around 200–300 simultaneous constraints. With 2026 models, that ceiling is closer to 2,000 — an order-of-magnitude jump. We re-ran IFScale to see how, and how each model fails.

The post Models got an order of magnitude better at following instructions in one year appeared first on Arize AI.

]]>
At AI Engineer: Miami I was watching a talk by Dexter Horthy in which he mentioned as an aside that research showed that models have trouble following more than 150-200 instructions at once. That struck me as a really interesting fact, so I tracked down where he got that from: the IFScale benchmark from Jaroslawicz et al. (2025). That paper is nearly a year old, so I wondered how much better models have become since then. The answer is: a whole lot better, in fact, an order of magnitude better.

Let’s not beat around the bush though. Here’s the data:

IFScale-2026: how many constraints can a frontier model satisfy at once?
IFScale-2026: how many constraints can a frontier model satisfy at once?

A quick TL;DR of this chart:

  • The Y-axis is accuracy: given a whole bunch of rules to follow, what percentage of them does the model actually follow?
  • The X-axis, which is log scale, is how many rules the model is trying to follow at once.
  • The faint dotted lines are three older models that were available when the original paper was written and are still available today. You can see that past 100 rules, they begin to ignore some of the instructions they’re given. By 500 rules, they’re beginning to drop as many as half of them.
  • The bold lines are some current frontier models. You can see they get much further before they start dropping instructions. GPT 5.5 does the best while DeepSeek V4 Pro does the worst.

So that’s our headline finding: a year ago, frontier models started losing track of instructions at somewhere around 200-300 simultaneous constraints. Depending on what model you pick, that boundary is now closer to 2,000 instructions.

Put simply, frontier models have gotten close to 10X better at following instructions in the last 12 months, and this has a lot of implications for real-world AI engineering, including:

  • Skills files no longer have a compression problem
  • Prompts can be extremely detailed
  • A hard boundary of capability has become a soft tradeoff of cost-versus-capability

We’ll explain in full, but there’s a lot of nuance here, so we invite you to read on.

The original IFScale benchmark

Our work is based on the benchmark paper Jaroslawicz et al. (2025). The test is pretty simple: ask a model to write a business report that contains N specific keywords (chosen from a vocabulary of 500 ordinary English words like “customer”, “revenue”, “logistics”), then count how many keywords showed up correctly in the output.

The prompt itself is short:


You are tasked with writing a professional business report that adheres strictly to a set of constraints. Each constraint requires that you include the exact, literal word specified… The report should be structured like a professional business document with clear sections and relevant business insights. Do not simply repeat the constraints; rather, use them to inform the text of the report.

 

CONSTRAINTS

  1. Include the exact word: ‘customer’.
  2. Include the exact word: ‘revenue’. …

We test this output the same way the original paper did: with a simple regex-based exact-match. Plurals don’t count. Hyphenations don’t count. “Customer” satisfies “customer”; “customers” does not. We call the number of keywords density or N, and the percentage of them that show up in the output is accuracy.

We believe, as the original authors did, that this is a good proxy for the more general question of “how many instructions can a model follow at once?” The keywords are arbitrary, but they represent the kind of discrete, named constraints that show up in real-world skills files: “if the user says X, do Y”, “include a section on Z”, “don’t use the phrase W”. If a model can’t track 200 discrete items in a single prompt, that’s a problem for any skill spec with more than 200 items in it.

The original paper’s results are still true

Before extending the benchmark, we figured we should try replicating the original finding. So we re-ran it on three models from the original paper that are still available a year later: GPT-4.1, Claude Sonnet 4 (May 2025 release), and Gemini 2.5 Pro.

Fun fact: it turns out May 2026 is the last month these models will be available! They all get retired in June, so we got lucky that there was still something to compare.

We copied the original paper’s prompt and the vocabulary of 500 words it used to test, and ran it from N=10 to N=500, 5 tries each, averaged. This worked: our accuracy curves matched the shapes the original paper reported, with deltas under 3 percentage points at low densities and growing to about 10 points at N=500 which was well within the noise of a five-seed test.

The original models, tested with the original methodology, show the same accuracy curves as the original paper.
The original models, tested with the original methodology, show the same accuracy curves as the original paper.

Moving the goalposts

Our initial plan was to simply try newer models on exactly the same data: same prompt, same vocabulary, same N range. But we quickly ran into a problem: the new models were doing so well that they were hitting 100% accuracy at N=500. The original paper’s ceiling was 500 constraints because that’s where the models of the day started dropping instructions; but current models are still perfect at 500 words!

So we had to make things harder: more words to include, out of a larger vocabulary. This took a number of attempts because we couldn’t find the ceiling. We doubled the number of instructions to follow, then doubled it again. Eventually we landed on a 10,000-word vocabulary before we started seeing meaningful degradation. Frontier models are a lot better!

Frontier models fail in different ways

The new frontier models fail in different ways.
The new frontier models fail in different ways.

You’ve already seen the new data: the frontier models do way better. But how they do better is fascinating. They mostly don’t show the same kind of failure mode of merely “ignoring instructions”. Instead they have different problems when the number of rules gets really high.

DeepSeek V4 Pro is the closest to the original pattern: it starts dropping instructions around N=750, and by N=2,000 it’s ignoring nearly half of them.

Claude Opus 4.7 thinks the test is dangerous: In the chart, you can see that Claude does quite a lot better than DeepSeek, but the chart hides the curious behavior we observed: instead of merely forgetting instructions, Claude would outright refuse to answer, returning an API-level “refusal” error.

As far as we can tell, this is the unexpected effect of an Anthropic safety feature. Opus has a very sensitive “refusal classifier”: if you include certain combinations of words (like “anthrax” and “cyanide”) in the prompt, it will refuse to answer at all. The more words we included, even deliberately innocuous ones, the more likely we were to hit a combination of words that added up to “danger” as far as Claude is concerned.

We ended up having to run the entire vocabulary through OpenAI’s moderation API to remove “danger” words before we could get Opus to stop refusing, and even then we had to retry a lot. But the curve is still real: when it didn’t refuse outright, Opus was still beginning to forget some of the instructions it was given. By N=5,000, it was only following about half of them. But remember, the old ceiling was 200! This was still an order of magnitude improvement at least.

Gemini 3.1 Pro is rock-solid until it starts overthinking: Gemini’s data has a lot of noise in it because Gemini’s behavior is very unpredictable at high N. Up to N=5,000 it does incredibly well. Past that, it begins to fail in a very strange way: instead of forgetting instructions, it just spends its entire output budget on internal reasoning tokens and emits little or no visible report at all. It’s as if the model is trying so hard to follow the instructions that it runs out of “thinking space” and produces no answer at all.

GPT 5.5 thinks this test is stupid: GPT 5.5 does the best of all the models we tried, holding 99% accuracy through N=5,000 (with one stochastic dip at N=4,000 that you can see in the chart). Past that, it falls off. And at very high N, instead of merely dropping instructions, we began to get refusals. Instead of API-level refusals like Opus, GPT 5.5 would occasionally respond with a polite output message like this one:

I’m sorry, but the requested report cannot be produced in full within the practical response limits of this interface because it requires incorporating 4,000 exact terms while also maintaining a coherent professional business-report structure.

It would start generating the report, get frustrated, and then stop with a message like the above. In fairness to GPT 5.5, it’s sort of true! Asking for a coherent business report without specifying what the report’s about except that it should contain 5,000 specific words is a pretty unreasonable request, and GPT called us out on it. But it still counts as failure; the half-finished reports it produced would have a fraction of the required keywords.

What this means for real-world AI engineering

The capacity to track 2,000 (and up to 5,000!) simultaneous named constraints in a single prompt is now real on GPT 5.5 and Gemini 3.1 Pro. A year ago, all the frontier models would have failed at a fraction of that. This has some pretty big implications for how we build skills and agents:

  • Skills files have less of a compression problem. If you were relying on data from a year ago, you’d be writing skills files that were pretty short (200 instructions or fewer), and then pointing to sub-skills or subagents. This is much less necessary now. You can include really long, detailed instructions in a single file and have confidence that the model will follow them.
  • Prompts can be extremely detailed. If you have a use case that requires a lot of discrete constraints, you can now include them all in the prompt without worrying about the model losing track of them. This opens up new possibilities for complex tasks that require a lot of specific instructions. Anecdotally, a lot of people have been discovering this already.
  • The trade-off has shifted from “can the model do it?” to “is the cost worth it?” A prompt or a skill with 2,000 instructions in it is probably going to work. It’s also going to be really long! That will make it more expensive from a token perspective and slower to run. So you still have to think about length, but it’s a much softer ceiling than it was.

Some caveats

We intend to turn our findings into a real, formal paper with a full methodology breakdown and a more rigorous analysis of the failure modes. In the meantime, here are some important caveats to keep in mind about this data:

  • IFScale measures named-item inclusion. The capacity result is evidence that long skills files are viable, not proof that every kind of instruction in them is followed. We are using a proxy task that we think generalizes to all instructions, but proof is a higher bar than evidence.
  • Different models hit the wall at very different N. The chart shows inflection points spanning roughly N=750 to N=9,000+. Pick your model carefully, and if it’s Claude remember to look out for danger words.
  • Not every failure is at the API level. Claude’s API-level refusals are annoying, but GPT’s half-finished reports with polite refusals are more frustrating: you have to read the entire output to know if it was really paying attention or if it gave up.

What it cost

We instrumented every run with Arize AX from day one. It gave us a fast way to filter traces for various failure modes, and at the end we asked AX a question we couldn’t have answered any other way: what did this whole experiment cost?

Model Calls Cost (USD)
claude-opus-4-7 1,326 $121.82
gpt-5.5-2026-04-23 92 $37.51
gemini-3.1-pro-preview 120 $23.64
claude-sonnet-4-20250514 250 $15.45
gemini-2.5-pro 251 $6.10
gpt-4.1-2025-04-14 250 $3.46
deepseek-v4-pro 56 $1.22
Total 2,345 $209.19

The answer is $209. Note that we spent far more on Claude than anything else because Claude was also running the LLM-as-a-judge that checked the output for coherence and flagged refusals. It also includes a lot of trial-and-error runs as we were tuning the vocabulary and trying to find the ceiling for each model.

What’s next

As mentioned, we’re hoping to submit this research as a formal paper, which will require extra work. But in the meantime, the conclusions are yours to act on, and all the code and data is open source and available for you to explore and build on.

A year ago, skills files were a compression problem. They aren’t anymore. Now they’re a verification problem. Plan accordingly.

If you’re building agents or skills against the 2026 frontier and want help thinking about how to evaluate them at scale, Arize AX is what we use.

The post Models got an order of magnitude better at following instructions in one year appeared first on Arize AI.

]]>
From observability to context: What’s next for Arize Phoenix https://arize.com/blog/from-observability-to-context-whats-next-for-arize-phoenix/ Mon, 11 May 2026 13:46:54 +0000 https://arize.com/?p=28375 As agents start changing software, they need a way to verify their work that includes traces, evals, feedback, and APIs. This is where Phoenix goes next — not the next release, but what this product becomes.

The post From observability to context: What’s next for Arize Phoenix appeared first on Arize AI.

]]>

As agents start changing software, they need a way to verify their work that includes traces, evals, feedback, and APIs. This is where Phoenix goes next — not the next release, but what this product becomes.

Observability was built for a world where humans did the reasoning. A human deployed code, reviewed traces, triaged alerts, and diagnosed failures. The tooling reflected that workflow: dashboards, filters, alerts, and evaluation pipelines. All of it assumed a human in the loop.

That assumption is rapidly breaking. Agents are becoming operators, too. They write code, change prompts, call tools, and modify systems. But they cannot work from tools designed only for human inspection. They need usable context: traces, evals, feedback, experiments, and annotations they can query, interpret, and act on.

That is what we are building with Phoenix: not just observability for humans, but a context platform for humans and agents to build great AI-native software together.

TL;DR

  • Coding agents need feedback loops to verify whether their changes actually improved system behavior.
  • Traces, evals, experiments, and feedback are becoming the verification layer for agentic systems.
  • Context cannot live only in dashboards; it needs to be accessible through APIs, CLIs, and agent-facing interfaces.
  • The goal is to close the loop so agents can self improve: trace, eval, diagnose, fix, and rerun
  • We believe AI observability is evolving into a context platform where humans and agents will debug and improve systems together.

The evolution of software engineering

Diagram titled “Who consumes telemetry data?” showing three side-by-side phases of software evolution and how telemetry flows through observability systems. Phase 1: Software 1.0 Flow: Application → Observability Platform → Human Dev → IDE. The application sends telemetry to the observability platform, which is read by a human developer who writes code in an IDE and deploys updates back to the application. Caption: “Human is the sole consumer of telemetry.” Phase 2: Software 2.0 Flow: Application → Observability Platform → Human Dev → Agentic IDE containing a Coding Agent (Claude Code / Cursor). The human developer prompts the coding agent inside the IDE to write code, then deploys updates back to the application. Caption: “Human prompts, agent writes code inside IDE.” Phase 3: Autonomous Flow: Application → Observability Platform → Coding Agent (Claude Code / Cursor) → Human. The coding agent directly reads telemetry from the observability platform, iterates autonomously, and only notifies the human, who “monitors only.” Caption: “Agent autonomously consumes telemetry.” Bottom callout text: “Observability platforms must evolve from human dashboards to programmatic interfaces that agents can consume.”

Two things changed at once.

First, agents started writing and modifying a lot of code. Six months ago, most of us coded by hand with occasional AI assist. Now, many of us are running multiple agents in parallel, and stepping in only when careful scrutiny is needed. But the infrastructure around these workflows hasn’t fully caught up. A coding agent can generate changes, but it often has no effective way to determine whether those changes actually improved the AI system’s behavior.

At the same time, AI-native software is increasingly becoming the norm. Today’s software involves more than just code. You have code, prompts, and model weights. In these systems, you can’t review behavior before it runs. You can only observe it after.

For agents to close the loop, they need verification. For humans to trust agent-generated changes, those agents need to produce evidence and share context: traces, eval results, experiment comparisons, and failure examples. That evidence can’t just live in dashboards. It needs to be available through programmatic interfaces like APIs, CLIs, and other agent-accessible endpoints so agents can inspect what happened, reason over failures, and prove whether their changes worked.

Why traces matter more when behavior is non deterministic

Comparison diagram contrasting “Traditional Software” with “AI Agent” workflows. On the left, a panel titled “Traditional Software” with subtitle “You can read the logic.” A flowchart shows a deterministic code path: handleSubmit() leads to a decision diamond labeled “valid?” If “no,” flow goes to an “Error” box. If “yes,” flow goes to callAPI() Another decision diamond labeled “status?” branches: “500” leads to retry() “200” leads to save(), then Done Caption at bottom: “Every path is visible. Deterministic. You see it all before it runs.” In the center is “vs.” On the right, a panel titled “AI Agent” with subtitle “You can’t read the logic.” A diagram shows: agent.run(query) feeding into a central “LLM” node with the text “What should I do?” The LLM may choose actions like search(), write(), or analyze(), each labeled “maybe?” Additional loops connect to smaller LLMs and “Sub-Agent 1” and “Sub-Agent 2,” with annotations like “loop back” and “more loops.” All paths eventually lead to an “Output” box labeled “different every time.” Caption at bottom: “You can’t see any of this in the code. You only see it after, in the traces.” At the bottom of the overall graphic, an arrow labeled “source of truth shifts” points from a blue “Code” box to a green “Traces” box.

In traditional software, you can read the code. Every path a program might take is visible and deterministic; you see it all before it runs. In an agent, you can’t. Decisions happen inside the model at runtime and are different every time even when the context and prompts are exactly the same. You only see the outcome after it’s already been decided within the traces.

That means the source of truth has shifted from code to traces. Traces offer a record of each LLM call, tool invocation, routing decision, retrieved document, latency spike, and failure. They’ve become the substrate for evals, experiments, debugging, and regression analysis..

Without programmatic access to traces, an agent can change an agentic application but cannot reliably inspect whether the application behaved better after the change..

This is not abstract. When a coding agent produces a change to an agentic application, the only record of what that application did — its decisions, its tool calls, its outputs — lives in the traces. Source code tells you what could happen. Traces tell you what did. In a non-deterministic system, the latter matters more.

Phoenix is already vendor-agnostic with OpenTelemetry-based conventions that work across every major framework and provider. The traces are there. The question is what we do with them next.

From observability platform to a context platform

Every previous platform category was defined by what it observed: ML metrics, LLM inputs and outputs, agent trajectories, and so on.

The next useful layer is defined by what it enables with shared context humans and agents can both query and act on. It should hold the context including traces, evals, annotations, feedback, experiments, and expose it to both humans and agents so they can act on it.

Not an ML platform, LLM platform, or an agent platform. A context platform.

The distinction matters because it changes what we build. An observability platform optimizes for human consumption with dashboards, visualizations, and alerting rules. A context platform optimizes for both human and programmatic consumption. It should expose GraphQL APIs, CLI interfaces, MCP endpoints, and other surfaces where agents can query traces, correlate failures, and reason over eval results without a human intermediary.

We started building toward this with the Phoenix CLI earlier this year. That’s because AI coding assistants operate through the terminal and the filesystem. A browser-based UI is useful for humans, but inaccessible to an agent working in your IDE. Programmatic interfaces need to meet agents where they already are.

Context should lead to action

But a context platform that only serves dashboards is still waiting for a human to act. The next step is transforming context into action. We need to turn observability into something that doesn’t just describe what happened, but participates in fixing it.

This is the product thesis that drives everything that follows.

Agent evals

If traces are the source of truth and the platform’s job is to turn context into action, evaluation is where it starts.

The evaluation surface area is expanding

Framework diagram showing increasing complexity of AI evaluation systems across two axes. The vertical axis on the left is labeled “eval inputs” with an upward arrow. Input types increase from bottom to top: input / output reference full trajectory sandboxes repetitions ops metrics change history The horizontal axis along the bottom is labeled “what’s under test” with a rightward arrow. Categories increase from left to right: prompt tools security orchestration context mgmt environment infra self-reflection Four overlapping dashed rectangles represent progressively larger evaluation scopes: LLM (small purple box in lower-left) Covers mostly prompt and tools. Caption: “Is the output correct?” Agent (larger blue box surrounding LLM) Expands into orchestration and context management. Caption: “Did it make the right decisions?” Harness (larger teal box surrounding Agent) Extends into environment and infrastructure testing. Caption: “Does the system work reliably?” Fully Autonomous Agents (largest green box surrounding all others) Extends furthest across infrastructure and self-reflection, and highest on the eval-input scale. Caption: “Does it get better over time?” The overall graphic illustrates how evaluating AI systems evolves from simple LLM output checking to broader system-level and autonomous-agent evaluation requiring richer inputs and wider operational coverage.

The evaluation problem is expanding along two dimensions simultaneously. On one axis, what’s being evaluated is growing from prompts and model outputs into tools, security, orchestration, context management, environment interaction, infrastructure, and self reflection. On the other axis, the eval inputs themselves are growing richer and evolving from simple input/output pairs to full trajectories, sandboxed execution, repetition analysis, ops metrics, and change history.

At the LLM level, the question is straightforward: is the output correct? But at the agent level, it’s more nuanced: did it make the right decisions? At the harness level, the question is more operational: does the system work reliably? And for fully autonomous agents, the question we’re ultimately building toward is self-improvement: does it get better over time?

The surface area of what needs evaluation is outpacing what a single LLM call can assess.

Moving from LLM-as-a-Judge to Agent-as-a-Judge

Comparison diagram contrasting “LLM as a Judge” with “Agent as a Judge.” On the left, under the heading “LLM as a Judge” with subtitle “Single pass”: Three input boxes labeled “Input,” “Output,” and “Criteria” point into a central purple circle labeled “LLM.” The LLM produces a final box labeled “Label, score, explanation.” The flow represents a one-step evaluation process where a model directly grades an output based on provided criteria. On the right, under the heading “Agent as a Judge” with subtitle “Iterative evaluation”: A central purple circle labeled “Agent” is connected to multiple surrounding components: “Traces” “Sandbox” “Subagents” “Feedback” “Multi-criteria evals” “Tools” “Context” Arrows show the agent interacting iteratively with these systems and evaluation inputs. The layout emphasizes a more dynamic, multi-step evaluation process involving tools, environments, feedback loops, and multiple evaluation dimensions. A dotted vertical line separates the two approaches, highlighting the shift from simple single-pass LLM judging to more complex agentic evaluation systems.

Evaluation strategies naturally evolve to match the complexity of what they measure. LLM-as-a-Judge emerged when generative outputs broke the assumption of a single right answer. That’s because LLM’s are flexible enough to capture subjectivity and customizable along the axes of quality teams actually care about.

But this approach is shaped for outputs LLMs produce in isolation: input/output pairs that fit cleanly into a prompt. A single pass with input, output, criteria in; and label, score, explanation out.

Agentic systems are the next step. Agents don’t produce outputs in isolation; they interact with tools, environments and humans, operating in a long-running loop. Evaluation agents fit this shape because they share it: iterative reasoning, tool use, context management, sub-agents, the ability to run code and verify outputs. The same capabilities that make agents useful in production make them well-suited to evaluating other agents.

This is where Phoenix’s role as a context platform becomes concrete. The evaluator agent consumes the same traces, experiment data, annotations, and feedback the rest of the system produces. In that way, evaluation becomes a consumer of the platform’s context like everything else instead of a separate workflow.

Closing the loop: trace, eval, fix, rerun

Everything we’ve discussed from traces as source of truth to the context platform, agent evals, and sandboxes converges on a single architectural goal: pulling verification into the agent loop itself.

Building verification into an agent loop

Workflow diagram showing how an agentic application integrates with Arize Phoenix and a coding agent to create an automated feedback loop for debugging and improvement. On the left: A box labeled “Agentic App” sends data into an “Arize Phoenix” section. The Phoenix section contains three stacked components: “Traces” “Evaluation” “Feedback” Text beside the arrows reads: “Traces, Evaluation, Feedback.” In the center: Outputs from Arize Phoenix flow into a box labeled “Phoenix CLI.” On the right: The Phoenix CLI connects to a green box labeled “Coding Agent.” Text above the connection reads: “Query, Correlate, Reason.” The Coding Agent connects downward to a box labeled “CODEBASE.” Text on that connection reads: “Implement change (PR) + Restart App.” A circular feedback loop appears on the far right: The CODEBASE re-runs the workload. A “test” step feeds back into the “user journey.” The user journey returns to the Coding Agent. Green text beside the loop reads: “Feedback Loop.” The overall diagram illustrates an automated development cycle where telemetry, evaluations, and feedback from Phoenix help a coding agent reason about issues, modify the codebase, rerun workloads, and iteratively improve the application.

Here’s how it works concretely:

  • An agentic application emits traces, evaluation results, and feedback into Phoenix.
  • A coding agent like Claude Code, or Cursor queries Phoenix through the CLI or API to understand what went wrong.
  • The coding agent correlates traces with eval failures, reasons over the context, implements a change, commits it, and re-runs the workload.
  • The new traces flow back into Phoenix and the loop starts again.

The early pieces of this workflow already exist. The Phoenix CLI already supports px traces for fetching trace data, px api graphql for arbitrary queries against the Phoenix backend, and agent skills that teach coding agents how to interpret trace data and diagnose failures. The next step is connecting these pieces into a continuous loop.

Moving towards self-improving agents and self-improving systems

Agents are rapidly gaining the ability to evolve themselves. The next step is self-improving agents and self-improving systems. To get there, we need tools, guardrails, and triggers that initiate improvement cycles, context to reason over, and CI gates that enforce quality before changes ship.

The core principle: make coding agents prove their work. Don’t trust that an agent’s change is correct. Instead, we need to require that agents demonstrate correctness through traces, evals, and experiments. Phoenix becomes the verification layer in this loop, or the infrastructure through which agents produce evidence and humans audit it.

Agents debugging agents

This brings us to the deeper implication. If agents can query traces, run evals, and reason over feedback, then agents can debug other agents. The human doesn’t disappear. They supervise. But the diagnostic work shifts.

Active and passive modes

Phoenix sits at the center of two interaction patterns:

  • In the active mode, a human directs an agent through Phoenix by querying traces, running evals, iterating on changes with the agent as a collaborator.
  • In the passive mode, an autonomous agent or assistant works proactively to surface issues and propose changes while the human reviews.

Both modes require the same infrastructure: shared context, programmatic access, and mechanisms for agents to present evidence of their reasoning. The difference is who initiates and iterates.

Over time, more workflows may shift toward the passive model. But the platform needs to support both modes because trust is contextual: teams will delegate more when failure modes are well understood, and retain tighter control when the stakes are high.

Phoenix Intelligence

This is where the product work converges: Phoenix Intelligence, an agent layer built directly into the platform.

We envision two tiers:

  • An assistant helps you along your AI engineering journey. This includes human-in-the-loop collaboration for debugging, dataset generation, eval design, and experiment setup. It meets you where you are, whether you’re a PM exploring production data or an ML engineer optimizing retrieval.
  • The experts autonomously work as specialized agents from our observability context. They research your use case, alert you about regressions, run evaluations, and surface insights continuously inside your observability platform.

Agents building agents, improving agents, observed by agents. The platform that holds the context is the platform that acts on it. That’s the thesis, and what we’re building.

Final thought: the dial of delegation

None of this means engineering is dead. In fact, it’s quite the opposite. As agents take on more of the verification work, what they can’t take on becomes more valuable: the judgment about when to trust them, what to delegate, and what to keep close.

The dial of delegation is a human decision. The teams that learn to turn it carefully by pulling back when the stakes are high and lean in when the patterns are well understood are the ones that will move fastest. Trust isn’t given to agents; it’s earned. The engineers who calibrate it well will define what their organizations can do.

Phoenix is an open source platform for agent development and evaluation. The CLI, tracing infrastructure, eval framework, and OpenInference conventions all ship in the open. What comes next builds on that foundation. If you’re thinking about these problems, we’d love to hear from you.

– The Arize Open Source Team

The post From observability to context: What’s next for Arize Phoenix appeared first on Arize AI.

]]>