Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.emergence.ai/llms.txt

Use this file to discover all available pages before exploring further.

LLM Observability

CRAFT uses two complementary observability systems that operate at different layers:
  • OpenTelemetry, infrastructure-level: HTTP requests, database queries, Redis, service health
  • Langfuse, LLM-level: model calls, token usage, cost, prompt quality, evaluation results
Both are valuable; neither replaces the other.

Why LLM Observability Is Different

Traditional observability (metrics, traces, logs) was designed for deterministic systems. LLM-powered applications introduce unique observability challenges:
ChallengeWhy it matters
Non-determinismThe same prompt can produce different outputs. Standard error rates don’t capture quality degradation.
Token economicsCost is proportional to token usage, not request count. A single call can cost 0.001or0.001 or 1.00 depending on prompt length.
Prompt engineeringChanging a prompt is a configuration change, not a code change, but it can dramatically affect output quality.
Evaluation”Is this response correct?” requires domain-specific evaluation, not just latency or error rate checks.
Multi-turn contextA user session spans multiple LLM calls. Standard tracing doesn’t capture the logical conversation flow.

What Each System Captures

OpenTelemetry (Infrastructure)

Service → [HTTP/DB/Redis spans] → OTel Collector → Grafana
SignalExamples
MetricsRequest rate, P99 latency, error rate, pod CPU/memory
TracesEnd-to-end request flow through services
LogsService logs, error messages, audit events
Answers: “Is the service up? Is it slow? Are there errors?”

Langfuse (LLM)

LLM call → [LiteLLM callback] → Langfuse
SignalExamples
TracesComplete LLM session with all turns, tools, and context
MetricsTokens used, estimated cost, latency per model call
EvaluationsLLM-as-a-judge quality scores, human labels, rubric results
PromptsVersion history, usage statistics, A/B comparison
Answers: “Is the AI producing quality output? What’s it costing? Which prompt version is better?”

Architecture

Integration Pattern

The platform uses LiteLLM as a provider-agnostic LLM proxy. LiteLLM natively supports Langfuse as a callback handler, requiring no changes to application code. When LANGFUSE_HOST is set, LiteLLM automatically:
  1. Records each LLM API call to Langfuse (prompt, completion, model, tokens, cost)
  2. Groups calls into sessions by conversation ID
  3. Reports evaluation scores if evaluators are configured

Langfuse Trace Anatomy

A Langfuse trace for a Data Insights session might look like:
Session: "user-query-schema-explain"
  └─ Trace: /api/chat (3.2s)
       ├─ Span: schema_lookup (0.3s) [DB query]
       ├─ Span: LLM call (2.5s)
       │    Model: gemini-2.0-flash
       │    Input tokens: 1,240
       │    Output tokens: 380
       │    Cost: $0.0018
       │    Quality score: 0.87 (LLM-as-judge)
       └─ Span: format_response (0.4s)

Cost Tracking

Langfuse aggregates LLM costs across all calls, enabling:
  • Cost per conversation / session / user
  • Cost trends over time
  • Model comparison (cost vs. quality tradeoff)
  • Budget alerting (configurable thresholds)

Langfuse

Deploy and configure Langfuse.

OpenTelemetry

Infrastructure observability with OTel.