Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.emergence.ai/llms.txt

Use this file to discover all available pages before exploring further.

Debugging Agents

Agent failures are rarely obvious from error messages alone. This page covers how to inspect traces, analyse tool calls, iterate on prompts, and diagnose the most common failure modes.

Toolbox

Before debugging, confirm you have access to these tools:
ToolPurpose
LangfuseTrace inspection, span hierarchy, token usage, LLM input/output
ADK Web (adk web)Interactive ADK agent debug UI — replay conversations, inspect tool calls
curl / jqDirect A2A JSON-RPC invocation for isolated testing
Agent Card (/.well-known/agent-card.json)Verify capabilities and skills are declared correctly
Application logsStructured JSON logs with task_id, context_id, user_id for correlation

Reading a Trace in Langfuse

Every agent request produces a trace in Langfuse. The span hierarchy for a Pydantic AI agent looks like:
agent.execute                          # outer span (task_id, context_id, user_id)
  agent.llm_stream [attempt=0]         # LLM call span
    mcp.tool.get_schema                # MCP tool span
    mcp.tool.execute_sql               # MCP tool span
    mcp.tool.upload_artifact           # MCP tool span
Key attributes to check:
  • agent.response_length — if 0, the agent produced no output (likely an error)
  • mcp.tool.statussuccess or error classification
  • mcp.tool.result_size_bytes — large values (>50KB) indicate context bloat risk
  • gen_ai.usage.input_tokens / gen_ai.usage.output_tokens — token budget

Symptom Index

Most likely causes:
  1. Task validation failure — the request is missing task_id, context_id, or authenticated user context. Check application logs for ValueError: Task ID and Context ID must be provided.
  2. Authentication error — the JWT token is missing or invalid. If your executor validates the Authorization header before processing the task, a missing or invalid token raises before emitting any events.
  3. MCP connection failure — the StreamableHttpTransport cannot connect to the MCP server. Check that MCP_SERVER_URL is set and the MCP pod is healthy.
Diagnostics:
# Check agent card — if this fails, the agent isn't running
curl http://localhost:8003/.well-known/agent-card.json

# Send a minimal test request
curl -X POST http://localhost:8003/ \
  -H "Content-Type: application/json" \
  -H "x-user-id: test-user" \
  -d '{
    "jsonrpc":"2.0","method":"message/send",
    "params":{"message":{"role":"user","message_id":"m1","context_id":"c1",
      "parts":[{"kind":"text","text":"hello"}]}},
    "id":"1"
  }'

# Check agent logs
kubectl logs -l app=my-agent --tail=100 | grep ERROR
Most likely causes:
  1. MCP server returning errors — the tool call reaches the MCP server but returns a structured error. Check mcp.tool.status in the trace span.
  2. Iteration limit hit — the agent has exceeded max_code_failures. Look for log line: Iteration limit reached for task {task_id}.
  3. Forbidden operation — the tool call uses a restricted pattern (e.g., filesystem access, blocked import). Look for LINT_ERROR in the tool result.
  4. Tool schema mismatch — the LLM is passing incorrect argument types. Check the mcp.tool.param_fingerprint across calls — if it’s consistent and always failing, the tool schema is wrong.
Diagnostics:
# Test the tool directly, bypassing the agent
from fastmcp import Client
from fastmcp.client.transports import StreamableHttpTransport

async def test_tool():
    client = Client(transport=StreamableHttpTransport("https://craft.emergence.ai/mcp",
        headers={"Authorization": "Bearer <token>", "X-Project-ID": "<your-project-id>"}))
    async with client:
        result = await client.call_tool(
            name="get_schema",
            arguments={"table_name": "orders", "schema_fqn": "db.schema.public"},
        )
        print(result)
Symptoms: Agent responses become shorter, less accurate, or the LLM refuses to call tools. Token usage approaches the model’s context limit.Most likely causes:
  1. Tool results too large — a tool is returning large payloads (DataFrames, Plotly figures) directly to the LLM. Your toolset should strip large fields and store them in the Assets API, passing only the resource URI.
  2. Conversation history too long — the task store is loading the full conversation history. Check agent.context_metrics.history_messages in the trace.
  3. System prompt too large — the instruction builder is including too much context. Check agent.context_metrics.instruction_length.
Diagnostics:
# Check token usage in Langfuse — look for input_tokens near model limit
# Gemini Flash: 1M tokens; Gemini Pro: 2M tokens; Claude Sonnet: 200K tokens

# Check tool result sizes in the trace
# mcp.tool.result_size_bytes > 50000 is a warning sign
Fix: Implement side payload interception in your toolset. Strip large data blobs (DataFrames, images, Plotly figures) before they reach the LLM context window; store them in the Assets API and pass only the resource URI.
Symptoms: A single agent request consumes 10x the expected tokens. The LLM is looping on tool calls or generating excessively long responses.Most likely causes:
  1. Missing iteration limit — no max_code_failures guard on code execution tool. The LLM keeps trying different code variations.
  2. Tool always returning errors — the LLM keeps retrying a broken tool. Check mcp.tool.status across the trace — all error with the same tool name is a signal.
  3. Infinite delegation loop — two agents are delegating to each other. Check the orchestrator’s sub_agents list for circular references.
  4. Large system prompt being rebuilt per turn — the instruction builder is fetching context on every LLM round-trip. Check agent.context_metrics.instruction_build_duration_s.
Diagnostics:
# Count tool calls per task from Langfuse
# agent.tool_calls_per_request > 20 is suspicious for most agents
Fix:
# Add hard iteration limits
if ctx.deps._code_failure_count >= settings.max_code_failures:
    return {
        "success": False,
        "error": "Execution limit reached. Ask the user for clarification.",
    }

# Set max_tokens to bound output cost
model_settings = ModelSettings(
    temperature=0,
    max_tokens=4096,  # bound output
)
Symptoms: state=failed with message “No datasource DataPart found” or “resource_uri resolution failed”.Most likely causes:
  1. Missing DataPart in the message — the caller did not include a DataPart with type: "datasource". The A2A message must include both a TextPart and a DataPart.
  2. Wrong resource_uri format — the resource_uri must be in the full four-segment format: data:{org_id}:{project_id}:{name}. The simplified format (data:my-db) is not accepted.
  3. Missing selected_schemasselected_schemas is empty or absent. Text2SQL requires exactly one schema entry.
Diagnostics:
# Inspect the incoming A2A message for DataParts
kubectl logs -l app=<agent-name> | grep "datasource"

# Test with a complete message
curl -X POST https://<agent-host>/ \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc":"2.0","method":"message/send",
    "params":{"message":{"role":"user","message_id":"m1","context_id":"c1","parts":[
      {"kind":"text","text":"count rows in orders"},
      {"kind":"data","data":{
        "type":"datasource",
        "resource_uri":"data:acme:proj:analytics-db",
        "datasource_type":"database",
        "datasource_name":"Analytics DB",
        "selected_schemas":[{"schema_name":"public","schema_fqn":"db.db.public"}]
      }}
    ]}},"id":"1"
  }'

Prompt Iteration

The fastest way to improve agent quality is iterating on the system prompt. Use ADK Web or direct A2A calls to test prompt changes without redeploying.

ADK Web — Interactive Replay (Google ADK)

# Start your agent in development mode
uvicorn my_agent.agent:a2a_app --host 0.0.0.0 --port 8003 --reload

# Open the ADK Web UI in a second terminal
adk web packages/my_agent

# Navigate to http://localhost:8000 and replay test conversations
ADK Web shows each tool call, its arguments, and the LLM’s reasoning before and after. Use it to observe exactly how the prompt influences routing and tool selection.

Claude Agent SDK — Prompt Replay

For Claude-based agents, replay prompt variations using the Anthropic SDK directly without running the full A2A server:
import anthropic

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="[paste your agent's system prompt here]",
    messages=[{"role": "user", "content": "[failing input]"}],
    tools=[...],  # paste your tool definitions
)
print(response.content)
Trace inspection: look for anthropic.messages.create spans in Langfuse. Tool use blocks appear as child spans with input and output fields.

LangGraph — Debug with debug=True

LangGraph’s astream supports verbose debug output and LangSmith/Langfuse tracing:
# Enable verbose node-level output during development
async for step in graph.astream(
    {"messages": [{"role": "user", "content": "..."}]},
    stream_mode="updates",
    debug=True,
):
    for node_name, output in step.items():
        print(f"[{node_name}]", output)
For production tracing, pass the langfuse_handler callback (see Eval Harness) — spans appear as langgraph:node:<name> entries in Langfuse.

Minimal Repro with Direct curl

For non-ADK agents, replay failing conversations directly:
# Capture a failing trace from Langfuse (input field)
# Replay it directly against the agent
curl -X POST http://localhost:8003/ \
  -H "Content-Type: application/json" \
  -H "x-user-id: debug-user" \
  -d @failing_trace_input.json | python3 -m json.tool

Prompt Change Checklist

Before changing the system prompt:
  1. Identify the specific behaviour to change (use a Langfuse trace as evidence)
  2. Write a test case that captures the failure
  3. Make the minimum prompt change needed to fix the test case
  4. Run the full regression suite to check for new regressions
  5. Re-check token usage — prompt changes can inflate or deflate input token cost

Structured Logging for Correlation

All CRAFT agents log structured JSON with task_id and context_id. Use these to correlate logs with Langfuse traces:
logger.info(
    "Tool call completed",
    extra={
        "tool_name": name,
        "task_id": task_id,
        "context_id": context_id,
        "duration_s": duration,
        "status": "success",
    },
)
# Find the Langfuse trace for the same task
# Search by task_id in Langfuse UI filter: trace.metadata.task_id = "task-uuid-here"

Next Steps

Eval Harness

Build regression suites to catch issues before they reach production.

Langfuse Setup

Configure Langfuse tracing for your agent.