Agent quality degrades silently. A prompt change, a model upgrade, or a new tool can alter behaviour across hundreds of evaluation dimensions without surfacing as an exception. A structured eval harness catches regressions before they reach users.
CRAFT uses Langfuse for observability and evaluation. Before setting up your eval harness, complete the Langfuse Setup Guide to connect your agent to your Langfuse project.
Pydantic AI emits OTEL spans automatically when a TracerProvider is configured. Every agent run, tool call, and MCP request appears as a structured span.
The Anthropic SDK emits OTEL spans automatically when a TracerProvider is configured — the same setup as Pydantic AI and ADK works without additional configuration:
A golden trace is a recorded agent execution that represents the correct behaviour for a given input. Store golden traces in your test fixtures and replay them to catch regressions.
import jsonfrom langfuse import Langfuselangfuse = Langfuse()# Fetch a specific trace by ID from Langfusetrace = langfuse.fetch_trace("trace-id-from-langfuse-ui")# Serialize to a fixture filegolden = { "input": trace.input, "expected_output": trace.output, "tool_calls": [ {"name": span.name, "input": span.input, "output": span.output} for span in trace.observations if span.type == "SPAN" and span.name.startswith("mcp.tool.") ],}with open("tests/fixtures/golden_revenue_query.json", "w") as f: json.dump(golden, f, indent=2)
For structured outputs where correctness can be computed without an LLM:
def score_tool_call_order(trace_id: str, tool_calls: list[dict]): """Score whether the agent called tools in the correct order.""" names = [tc["name"] for tc in tool_calls] # get_schema must precede execute_sql if "execute_sql" in names and "get_schema" in names: schema_idx = names.index("get_schema") sql_idx = names.index("execute_sql") score = 1.0 if schema_idx < sql_idx else 0.0 elif "execute_sql" in names: score = 0.0 # called execute_sql without schema check else: score = 1.0 # no SQL — correct for non-SQL queries langfuse.score( trace_id=trace_id, name="tool_call_order", value=score, )
Track token usage per eval run to catch cost regressions alongside quality regressions:
# After each eval run, log usage to Langfuselangfuse.score( trace_id=trace_id, name="input_tokens", value=usage.input_tokens, data_type="NUMERIC",)langfuse.score( trace_id=trace_id, name="output_tokens", value=usage.output_tokens, data_type="NUMERIC",)langfuse.score( trace_id=trace_id, name="tool_call_count", value=toolset.tool_call_count, data_type="NUMERIC",)
Set budget alerts in Langfuse if token usage per task exceeds a threshold — this catches prompt regressions that inflate input context or cause tool call loops.