LLM Observability

CRAFT uses two complementary observability systems that operate at different layers:

OpenTelemetry, infrastructure-level: HTTP requests, database queries, Redis, service health
Langfuse, LLM-level: model calls, token usage, cost, prompt quality, evaluation results

Both are valuable; neither replaces the other.

Why LLM Observability Is Different

Traditional observability (metrics, traces, logs) was designed for deterministic systems. LLM-powered applications introduce unique observability challenges:

Challenge	Why it matters
Non-determinism	The same prompt can produce different outputs. Standard error rates don’t capture quality degradation.
Token economics	Cost is proportional to token usage, not request count. A single call can cost $0.001 or$ 1.00 depending on prompt length.
Prompt engineering	Changing a prompt is a configuration change, not a code change, but it can dramatically affect output quality.
Evaluation	”Is this response correct?” requires domain-specific evaluation, not just latency or error rate checks.
Multi-turn context	A user session spans multiple LLM calls. Standard tracing doesn’t capture the logical conversation flow.

What Each System Captures

OpenTelemetry (Infrastructure)

Service → [HTTP/DB/Redis spans] → OTel Collector → Grafana

Signal	Examples
Metrics	Request rate, P99 latency, error rate, pod CPU/memory
Traces	End-to-end request flow through services
Logs	Service logs, error messages, audit events

Answers: “Is the service up? Is it slow? Are there errors?”

Langfuse (LLM)

LLM call → [LiteLLM callback] → Langfuse

Signal	Examples
Traces	Complete LLM session with all turns, tools, and context
Metrics	Tokens used, estimated cost, latency per model call
Evaluations	LLM-as-a-judge quality scores, human labels, rubric results
Prompts	Version history, usage statistics, A/B comparison

Answers: “Is the AI producing quality output? What’s it costing? Which prompt version is better?”

Architecture

Integration Pattern

The platform uses LiteLLM as a provider-agnostic LLM proxy. LiteLLM natively supports Langfuse as a callback handler, requiring no changes to application code. When LANGFUSE_HOST is set, LiteLLM automatically:

Records each LLM API call to Langfuse (prompt, completion, model, tokens, cost)
Groups calls into sessions by conversation ID
Reports evaluation scores if evaluators are configured

Langfuse Trace Anatomy

A Langfuse trace for a Data Insights session might look like:

Session: "user-query-schema-explain"
  └─ Trace: /api/chat (3.2s)
       ├─ Span: schema_lookup (0.3s) [DB query]
       ├─ Span: LLM call (2.5s)
       │    Model: gemini-2.0-flash
       │    Input tokens: 1,240
       │    Output tokens: 380
       │    Cost: $0.0018
       │    Quality score: 0.87 (LLM-as-judge)
       └─ Span: format_response (0.4s)

Cost Tracking

Langfuse aggregates LLM costs across all calls, enabling:

Cost per conversation / session / user
Cost trends over time
Model comparison (cost vs. quality tradeoff)
Budget alerting (configurable thresholds)

Langfuse

Deploy and configure Langfuse.

OpenTelemetry

Infrastructure observability with OTel.

Observability

Security

Compliance

Release Notes

LLM Observability

LLM Observability

Why LLM Observability Is Different

What Each System Captures

OpenTelemetry (Infrastructure)

Langfuse (LLM)

Architecture

Integration Pattern

Langfuse Trace Anatomy

Cost Tracking

Langfuse

OpenTelemetry

Observability

Security

Compliance

Release Notes

Documentation Index

​LLM Observability

​Why LLM Observability Is Different

​What Each System Captures

​OpenTelemetry (Infrastructure)

​Langfuse (LLM)

​Architecture

​Integration Pattern

​Langfuse Trace Anatomy

​Cost Tracking

​Related

Langfuse

OpenTelemetry

LLM Observability

Why LLM Observability Is Different

What Each System Captures

OpenTelemetry (Infrastructure)

Langfuse (LLM)

Architecture

Integration Pattern

Langfuse Trace Anatomy

Cost Tracking

Related