LLM Gateway

The CRAFT platform routes all agent LLM traffic through a LiteLLM sidecar gateway deployed per agent workload. This page describes how the gateway works, how to configure it, and how to keep it observable and secure.

This page is written for platform operators and solution-team leads who need to configure or troubleshoot the gateway layer. If you are a solution developer looking for how to call an LLM from your solution code, see Access LLMs instead.

Architecture

Sidecar pattern (ADR Option 3)

The LiteLLM gateway is deployed as a sidecar container inside the agent pod, not as a shared cluster-level proxy. This is the ADR Option 3 “safety controls inside em-runtime” design. Each agent workload carries its own gateway instance. Consequences of this design:

Provider API keys are mounted only into the LiteLLM container, never into the agent container. The agent sees only the gateway’s local localhost URL and a gateway-scoped API key.
Rate limits and model allowlists are enforced per-agent-identity, in process with the agent’s request.
A gateway crash affects only that agent’s pod, not the cluster.

How requests flow

em-runtime-mcp as agent tool gateway

em-runtime-mcp is the tool-call gateway — the single endpoint through which every agent invokes a platform tool. LiteLLM handles LLM traffic; em-runtime-mcp handles tool traffic. Every tool call passes through a per-agent allowlist check and is recorded in the audit-event tables. The two gateways are complementary:

Gateway	Handles	Enforces
LiteLLM sidecar	LLM completions	Model allowlist, rate limits, cost attribution
`em-runtime-mcp`	Platform tool calls	Per-agent tool allowlist, audit events

Model registry

Allowlist configuration

The recommended LiteLLM standard config defines the models an agent is permitted to call. The allowlist is enforced at the sidecar level — requests for unlisted models return 400 Model not allowed.

# LiteLLM sidecar config (mounted as a ConfigMap)
model_list:
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-3-5-sonnet-20241022
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: gemini-1.5-pro
    litellm_params:
      model: gemini/gemini-1.5-pro
      api_key: os.environ/GEMINI_API_KEY

  # Self-hosted (no key required)
  - model_name: ollama/llama3.1
    litellm_params:
      model: ollama/llama3.1
      api_base: http://ollama.ollama.svc.cluster.local:11434

litellm_settings:
  drop_params: true       # tolerate extra params from different SDKs
  request_timeout: 120    # seconds

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY

The agent calls litellm.acompletion(model="gpt-4o-mini", ...) using the model_name alias, not the provider path. This makes switching providers a config change, not a code change.

Provider routing

Provider selection is driven by the model_name prefix in litellm_params.model:

Prefix	Provider
`openai/`	OpenAI
`anthropic/`	Anthropic
`gemini/`	Google Gemini
`ollama/`	Local Ollama instance
`vllm/`	vLLM endpoint
`azure/`	Azure OpenAI

The router picks the first entry in model_list whose model_name matches the requested alias. To add fallbacks, list multiple entries with the same model_name:

  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
  - model_name: gpt-4o-mini          # fallback
    litellm_params:
      model: azure/gpt-4o-mini
      api_key: os.environ/AZURE_API_KEY
      api_base: https://your-resource.openai.azure.com

Authentication

Gateway API key

The LiteLLM sidecar enforces a master_key. The agent container reads this key from a Kubernetes Secret via envVars.valueFrom.secretKeyRef — the key is never hardcoded.

# em-service extraContainers entry (snippet)
extraContainers:
  - name: litellm
    image: ghcr.io/berriai/litellm:main-latest
    env:
      - name: LITELLM_MASTER_KEY
        valueFrom:
          secretKeyRef:
            name: litellm-secrets
            key: master-key
      - name: OPENAI_API_KEY
        valueFrom:
          secretKeyRef:
            name: provider-secrets
            key: openai-api-key

The agent container receives only the gateway URL (http://localhost:4000/v1) and the gateway master key. It never sees the upstream provider key.

Per-project key isolation

Each project’s agents receive a distinct master_key. The platform provisions these keys via its configured secrets backend; key rotation triggers an automatic pod restart that propagates the new key without downtime.

Budget enforcement

Rate limits

Rate limits are configured in the LiteLLM config under router_settings:

router_settings:
  num_retries: 3
  timeout: 120
  retry_after: 5             # seconds between retries

# Per-model rate limits (tokens per minute, requests per minute)
model_list:
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
      tpm: 100000             # tokens per minute
      rpm: 500                # requests per minute

When a limit is exceeded, the gateway returns 429 Too Many Requests. Clients should back off exponentially.

Cost ceilings

Monthly spend ceilings are set per model alias. When the ceiling is reached, the gateway blocks further requests for that alias until the budget resets:

litellm_settings:
  max_budget: 50.00          # USD, per alias
  budget_duration: monthly

Overage behavior

When a rate limit or budget ceiling is hit:

The gateway returns 429 with a Retry-After header.
The agent should catch 429 and apply exponential back-off with jitter.
If the model alias has a fallback entry in model_list, the router tries the fallback automatically.
If no fallback is available and the budget is exhausted, the 429 propagates to the caller.

Do not remove the fallbacks list from the router config without first confirming that the upstream has headroom. A saturated primary with no fallback causes complete LLM outage for that agent.

Observability

Langfuse traces

LiteLLM auto-emits traces to Langfuse when the following env vars are set in the sidecar container:

LANGFUSE_HOST=https://langfuse.your-cluster.example.com
LANGFUSE_PUBLIC_KEY=<project-public-key>
LANGFUSE_SECRET_KEY=<project-secret-key>

Enable in the LiteLLM config:

litellm_settings:
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

Every completion call generates a Langfuse trace tagged with the metadata the agent passes: project_id, solution, trace_id. These tags are the basis for per-project cost attribution dashboards.

Prometheus and OpenTelemetry metrics

LiteLLM exposes a /metrics endpoint (Prometheus format) on port 4000. The platform OTEL Collector scrapes it and forwards to your observability backend. Key metrics:

Metric	Description
`litellm_requests_total`	Total completion requests by model and status
`litellm_tokens_total`	Total tokens consumed by model
`litellm_request_duration_seconds`	Latency histogram
`litellm_spend_usd`	Cumulative spend by model alias

To enable OTLP export from LiteLLM:

litellm_settings:
  service_callback: ["otel"]

environment_variables:
  OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector:4317"
  OTEL_SERVICE_NAME: "litellm-sidecar"

Provider configuration

OpenAI

- model_name: gpt-4o
  litellm_params:
    model: openai/gpt-4o
    api_key: os.environ/OPENAI_API_KEY

Mount OPENAI_API_KEY from a Kubernetes Secret into the LiteLLM sidecar container only.

Anthropic

- model_name: claude-3-5-sonnet-20241022
  litellm_params:
    model: anthropic/claude-3-5-sonnet-20241022
    api_key: os.environ/ANTHROPIC_API_KEY

Mount ANTHROPIC_API_KEY from a Kubernetes Secret into the LiteLLM sidecar container only.

Google Gemini

- model_name: gemini-1.5-pro
  litellm_params:
    model: gemini/gemini-1.5-pro
    api_key: os.environ/GEMINI_API_KEY

Mount GEMINI_API_KEY from a Kubernetes Secret. Alternatively, use Workload Identity if your cluster supports it.

Self-hosted — Ollama

- model_name: llama3.1
  litellm_params:
    model: ollama/llama3.1
    api_base: http://ollama.ollama.svc.cluster.local:11434

No API key required. The Ollama service must be reachable from the agent namespace.

Self-hosted — vLLM

- model_name: llama3.1-70b
  litellm_params:
    model: openai/llama3.1-70b
    api_base: http://vllm.vllm.svc.cluster.local:8000/v1
    api_key: os.environ/VLLM_API_KEY   # if vLLM auth is enabled

vLLM exposes an OpenAI-compatible API, so use the openai/ prefix.

Adding the sidecar to an agent pod

The em-service chart (version 0.0.15+) supports extraContainers. Add the LiteLLM sidecar as an extra container in the agent’s Helm values:

# charts/your-agent/values.yaml
extraContainers:
  - name: litellm
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - containerPort: 4000
    env:
      - name: LITELLM_MASTER_KEY
        valueFrom:
          secretKeyRef:
            name: litellm-secrets
            key: master-key
      - name: OPENAI_API_KEY
        valueFrom:
          secretKeyRef:
            name: provider-secrets
            key: openai-api-key
    volumeMounts:
      - name: litellm-config
        mountPath: /app/config.yaml
        subPath: config.yaml

extraVolumes:
  - name: litellm-config
    configMap:
      name: litellm-config

The agent container references the gateway via the loopback address:

env:
  - name: LLM_GATEWAY_URL
    value: "http://localhost:4000/v1"
  - name: LLM_GATEWAY_API_KEY
    valueFrom:
      secretKeyRef:
        name: litellm-secrets
        key: master-key

Disaster recovery — gateway unavailable

If the LiteLLM sidecar crashes or becomes unresponsive:

The agent pod continues running — the sidecar crash does not kill the main container.
LLM calls from the agent will receive Connection refused on localhost:4000.
Kubernetes restarts the sidecar container automatically (default restart policy Always).
If the sidecar does not recover, restart the pod: kubectl rollout restart deployment/<agent-deployment> -n <namespace>.

To verify the sidecar is healthy inside a running pod:

kubectl exec -n <namespace> deployment/<agent-deployment> -c litellm -- \
  curl -s http://localhost:4000/health | jq .

A healthy response returns {"status": "healthy"}. If the sidecar is healthy but requests fail, check the provider keys are correctly mounted:

kubectl exec -n <namespace> deployment/<agent-deployment> -c litellm -- \
  env | grep -E '(OPENAI|ANTHROPIC|GEMINI)_API_KEY' | sed 's/=.*/=<redacted>/'

Next steps

Access LLMs (solution dev)

How to call the gateway from solution code using litellm.

LLM Observability

Platform-side observability: Langfuse traces, cost dashboards, model comparison.

Manage Secrets

How provider keys and gateway keys flow through the secrets pipeline.

Platform Overview

How the gateway fits into the overall platform architecture.

MCP Server

Connect Claude Code, Cursor, Goose, or an external agent to CRAFT’s tool gateway over MCP.

Platform

Documentation Index

​LLM Gateway

​Architecture

​Sidecar pattern (ADR Option 3)

​How requests flow

​em-runtime-mcp as agent tool gateway

​Model registry

​Allowlist configuration

​Provider routing

​Authentication

​Gateway API key

​Per-project key isolation

​Budget enforcement

​Rate limits

​Cost ceilings

​Overage behavior

​Observability

​Langfuse traces

​Prometheus and OpenTelemetry metrics

​Provider configuration

​Adding the sidecar to an agent pod

​Disaster recovery — gateway unavailable

​Next steps

Access LLMs (solution dev)

LLM Observability

Manage Secrets

Platform Overview

MCP Server

LLM Gateway

Architecture

Sidecar pattern (ADR Option 3)

How requests flow

em-runtime-mcp as agent tool gateway

Model registry

Allowlist configuration

Provider routing

Authentication

Gateway API key

Per-project key isolation

Budget enforcement

Rate limits

Cost ceilings

Overage behavior

Observability

Langfuse traces

Prometheus and OpenTelemetry metrics

Provider configuration

Adding the sidecar to an agent pod

Disaster recovery — gateway unavailable

Next steps