Documentation Index
Fetch the complete documentation index at: https://docs.emergence.ai/llms.txt
Use this file to discover all available pages before exploring further.
LLM Gateway
The CRAFT platform routes all agent LLM traffic through a LiteLLM sidecar gateway deployed per agent workload. This page describes how the gateway works, how to configure it, and how to keep it observable and secure.This page is written for platform operators and solution-team leads who need to configure or troubleshoot the gateway layer. If you are a solution developer looking for how to call an LLM from your solution code, see Access LLMs instead.
Architecture
Sidecar pattern (ADR Option 3)
The LiteLLM gateway is deployed as a sidecar container inside the agent pod, not as a shared cluster-level proxy. This is the ADR Option 3 “safety controls inside em-runtime” design. Each agent workload carries its own gateway instance. Consequences of this design:- Provider API keys are mounted only into the LiteLLM container, never into the agent container. The agent sees only the gateway’s local
localhostURL and a gateway-scoped API key. - Rate limits and model allowlists are enforced per-agent-identity, in process with the agent’s request.
- A gateway crash affects only that agent’s pod, not the cluster.
How requests flow
em-runtime-mcp as agent tool gateway
em-runtime-mcp is the tool-call gateway — the single endpoint through which every agent invokes a platform tool. LiteLLM handles LLM traffic; em-runtime-mcp handles tool traffic. Every tool call passes through a per-agent allowlist check and is recorded in the audit-event tables.
The two gateways are complementary:
| Gateway | Handles | Enforces |
|---|---|---|
| LiteLLM sidecar | LLM completions | Model allowlist, rate limits, cost attribution |
em-runtime-mcp | Platform tool calls | Per-agent tool allowlist, audit events |
Model registry
Allowlist configuration
The recommended LiteLLM standard config defines the models an agent is permitted to call. The allowlist is enforced at the sidecar level — requests for unlisted models return400 Model not allowed.
litellm.acompletion(model="gpt-4o-mini", ...) using the model_name alias, not the provider path. This makes switching providers a config change, not a code change.
Provider routing
Provider selection is driven by themodel_name prefix in litellm_params.model:
| Prefix | Provider |
|---|---|
openai/ | OpenAI |
anthropic/ | Anthropic |
gemini/ | Google Gemini |
ollama/ | Local Ollama instance |
vllm/ | vLLM endpoint |
azure/ | Azure OpenAI |
model_list whose model_name matches the requested alias. To add fallbacks, list multiple entries with the same model_name:
Authentication
Gateway API key
The LiteLLM sidecar enforces amaster_key. The agent container reads this key from a Kubernetes Secret via envVars.valueFrom.secretKeyRef — the key is never hardcoded.
http://localhost:4000/v1) and the gateway master key. It never sees the upstream provider key.
Per-project key isolation
Each project’s agents receive a distinctmaster_key. The platform provisions these keys via its configured secrets backend; key rotation triggers an automatic pod restart that propagates the new key without downtime.
Budget enforcement
Rate limits
Rate limits are configured in the LiteLLM config underrouter_settings:
429 Too Many Requests. Clients should back off exponentially.
Cost ceilings
Monthly spend ceilings are set per model alias. When the ceiling is reached, the gateway blocks further requests for that alias until the budget resets:Overage behavior
When a rate limit or budget ceiling is hit:- The gateway returns
429with aRetry-Afterheader. - The agent should catch
429and apply exponential back-off with jitter. - If the model alias has a fallback entry in
model_list, the router tries the fallback automatically. - If no fallback is available and the budget is exhausted, the
429propagates to the caller.
Observability
Langfuse traces
LiteLLM auto-emits traces to Langfuse when the following env vars are set in the sidecar container:metadata the agent passes: project_id, solution, trace_id. These tags are the basis for per-project cost attribution dashboards.
Prometheus and OpenTelemetry metrics
LiteLLM exposes a/metrics endpoint (Prometheus format) on port 4000. The platform OTEL Collector scrapes it and forwards to your observability backend.
Key metrics:
| Metric | Description |
|---|---|
litellm_requests_total | Total completion requests by model and status |
litellm_tokens_total | Total tokens consumed by model |
litellm_request_duration_seconds | Latency histogram |
litellm_spend_usd | Cumulative spend by model alias |
Provider configuration
OpenAI
OpenAI
OPENAI_API_KEY from a Kubernetes Secret into the LiteLLM sidecar container only.Anthropic
Anthropic
ANTHROPIC_API_KEY from a Kubernetes Secret into the LiteLLM sidecar container only.Google Gemini
Google Gemini
GEMINI_API_KEY from a Kubernetes Secret. Alternatively, use Workload Identity if your cluster supports it.Self-hosted — Ollama
Self-hosted — Ollama
Self-hosted — vLLM
Self-hosted — vLLM
openai/ prefix.Adding the sidecar to an agent pod
Theem-service chart (version 0.0.15+) supports extraContainers. Add the LiteLLM sidecar as an extra container in the agent’s Helm values:
Disaster recovery — gateway unavailable
If the LiteLLM sidecar crashes or becomes unresponsive:- The agent pod continues running — the sidecar crash does not kill the main container.
- LLM calls from the agent will receive
Connection refusedonlocalhost:4000. - Kubernetes restarts the sidecar container automatically (default restart policy
Always). - If the sidecar does not recover, restart the pod:
kubectl rollout restart deployment/<agent-deployment> -n <namespace>.
{"status": "healthy"}.
If the sidecar is healthy but requests fail, check the provider keys are correctly mounted:
Next steps
Access LLMs (solution dev)
How to call the gateway from solution code using litellm.
LLM Observability
Platform-side observability: Langfuse traces, cost dashboards, model comparison.
Manage Secrets
How provider keys and gateway keys flow through the secrets pipeline.
Platform Overview
How the gateway fits into the overall platform architecture.
MCP Server
Connect Claude Code, Cursor, Goose, or an external agent to CRAFT’s tool gateway over MCP.

