LiteLLM Proxy

OpenAI-compatible API gateway with unified model access, virtual keys, and Langfuse observability

Status: Live
Category: AI - LLM Gateway
Maintainer: Haiven Infrastructure Team
Last Updated: 2025-12-22


Overview

LiteLLM Proxy is an OpenAI-compatible API gateway that sits in front of llama-swap and other AI services, providing:

Access Points

Purpose URL Notes
API Endpoint https://llm.haiven.local/v1 OpenAI-compatible API
Admin UI https://litellm.haiven.local/ui Key management, usage dashboard
Health Check https://llm.haiven.local/health Service health status
Metrics https://llm.haiven.local/metrics Prometheus metrics
Internal API http://litellm:4000/v1 Docker network access
TTS Pass-through https://llm.haiven.local/tts/v1/audio/speech Direct Piper TTS access
StyleTTS2 Pass-through https://llm.haiven.local/styletts2/v1/audio/speech Direct StyleTTS2 access
STT Pass-through https://llm.haiven.local/stt/v1/audio/transcriptions Direct Whisper access

Quick Start

1. Test API Access

# Health check
curl -s https://llm.haiven.local/health

# List available models
curl -s https://llm.haiven.local/v1/models | jq '.data[].id'

# Chat completion (using master key)
curl -s https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }' | jq '.choices[0].message.content'

2. Access Admin UI

Navigate to https://litellm.haiven.local/ui and log in with the master key.

3. Create Virtual Keys

# Create a new API key for a user
curl -s https://llm.haiven.local/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["gpt-4", "gpt-3.5-turbo"],
    "user_id": "user@example.com",
    "max_budget": 100.00,
    "duration": "30d"
  }' | jq

Architecture

                                    +-------------------+
                                    |    Langfuse       |
                                    | (Observability)   |
                                    +--------^----------+
                                             |
                                             | callbacks
                                             |
Client Request                               |
      |                                      |
      v                                      |
+------------+        +------------+    +----+-------+    +------------+
|  Traefik   | -----> |  LiteLLM   | -> | llama-swap | -> | GGUF Model |
| (TLS/Rate) |        |  (Proxy)   |    | (Backend)  |    | (GPU)      |
+------------+        +------------+    +------------+    +------------+
      |                     |
      |                     +---------> openedai-speech (TTS: tts-1, tts-1-hd)
      |                     |
      |                     +---------> styletts2-openai (TTS: styletts2)
      |                     |
      |                     +---------> faster-whisper (STT: whisper-1, whisper-large-v3)
      |                     |
      |                     +---------> SearXNG (search_tools)
      |                     |
      |                     v
      |              +-------------+
      |              | PostgreSQL  |
      |              | (Keys/Usage)|
      +              +-------------+
 llm.haiven.local
 litellm.haiven.local

Request Flow

  1. Client sends request to llm.haiven.local
  2. Traefik terminates TLS, applies rate limiting (100 avg / 200 burst)
  3. LiteLLM validates API key and applies model routing
  4. Request routed based on model type:
    - LLM models -> llama-swap backend -> GPU model inference
    - TTS models (tts-1, tts-1-hd) -> openedai-speech -> Piper/XTTS synthesis
    - TTS models (styletts2) -> styletts2-openai -> StyleTTS2 synthesis
    - STT models (whisper-1, whisper-large-v3) -> faster-whisper -> Audio transcription
  5. Response logged to Langfuse for observability
  6. Usage tracked in PostgreSQL database

Model Configuration

Model Mappings

LiteLLM routes requests based on model name:

Client Model Name Actual Model Backend
* (wildcard) Same as requested llama-swap
gpt-4 qwen3-30b-a3b llama-swap (PRO 6000)
gpt-4-turbo qwen3-30b-a3b llama-swap (PRO 6000)
gpt-3.5-turbo qwen2.5-14b-instruct llama-swap (RTX 4090)

Available Models

All models available in llama-swap are accessible through LiteLLM:

# List all available models
curl -s https://llm.haiven.local/v1/models | jq '.data[].id'

Common models include:
- qwen3-30b-a3b - Qwen3 30B (Q8_0, 31B params)
- qwen2.5-14b-instruct - Qwen2.5 14B Instruct
- gemma3-27b - Gemma 3 27B
- gpt-oss-120b - OpenAI OSS 120B

TTS Models (Text-to-Speech)

Model Name Backend Description
tts-1 openedai-speech Piper TTS - fast CPU-based synthesis
tts-1-hd openedai-speech XTTS - high quality voice cloning
styletts2 styletts2-openai StyleTTS2 - neural TTS with style transfer

Available Voices: alloy, echo, fable, onyx, nova, shimmer

STT Models (Speech-to-Text)

Model Name Backend Description
whisper-large-v3 faster-whisper High accuracy GPU-accelerated transcription
whisper-1 faster-whisper OpenAI compatibility alias

Supported Languages: 99+ languages with automatic detection

Search Tools

Tool Name Backend Description
searxng-search SearXNG Meta-search engine for LLM function calling

Models with supports_function_calling: true can use the search tool when tools are enabled in requests.


API Endpoints

OpenAI-Compatible Endpoints

Method Endpoint Description
GET /v1/models List available models
POST /v1/chat/completions Chat completions (streaming supported)
POST /v1/completions Text completions
POST /v1/embeddings Generate embeddings
POST /v1/audio/speech Text-to-speech synthesis
POST /v1/audio/transcriptions Speech-to-text transcription

Pass-through Endpoints

These endpoints bypass LiteLLM routing and forward directly to backends:

Method Endpoint Target
POST /tts/v1/audio/speech openedai-speech (Piper/XTTS)
POST /styletts2/v1/audio/speech styletts2-openai wrapper
POST /stt/v1/audio/transcriptions faster-whisper

LiteLLM-Specific Endpoints

Method Endpoint Description
GET /health Basic health check
GET /health/liveliness Kubernetes liveness probe
GET /health/readiness Kubernetes readiness probe
GET /metrics Prometheus metrics
GET /ui Admin dashboard
POST /key/generate Create virtual API key
POST /key/delete Delete API key
GET /key/info Get key information
GET /spend/logs Get spend logs
GET /user/info Get user information

Example: Chat Completion

curl -X POST https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "temperature": 0.7,
    "max_tokens": 500,
    "stream": false
  }'

Example: Streaming Response

curl -X POST https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Write a haiku about coding"}],
    "stream": true
  }'

Example: Text-to-Speech (TTS)

# Using Piper TTS (fast)
curl -X POST https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "tts-1",
    "input": "Hello, this is a test of the text-to-speech system.",
    "voice": "alloy"
  }' --output speech.mp3

# Using StyleTTS2 (high quality)
curl -X POST https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "styletts2",
    "input": "High quality neural speech synthesis.",
    "voice": "nova"
  }' --output speech_hq.wav

# Via pass-through endpoint (bypasses router)
curl -X POST https://llm.haiven.local/tts/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Direct access to Piper TTS.",
    "voice": "echo"
  }' --output speech_direct.mp3

Example: Speech-to-Text (STT)

# Transcribe audio file
curl -X POST https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@audio.wav" \
  -F "model=whisper-large-v3"

# Using OpenAI-compatible model name
curl -X POST https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@recording.mp3" \
  -F "model=whisper-1"

# Via pass-through endpoint
curl -X POST https://llm.haiven.local/stt/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper-1"

Example: Web Search with Function Calling

curl -X POST https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "qwen3-30b-a3b-q8-abl",
    "messages": [{"role": "user", "content": "What are the latest AI news today?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "searxng-search",
        "description": "Search the web for information"
      }
    }],
    "tool_choice": "auto"
  }'

Network Configuration

Networks

Network Type Purpose
web External Traefik routing, public access
backend External Internal service communication
litellm-internal Bridge Stack internal (LiteLLM <-> PostgreSQL)
langfuse-internal External Connection to Langfuse for tracing

Traefik Labels

# Primary API endpoint (llm.haiven.local)
- "traefik.http.routers.litellm.rule=Host(`llm.haiven.local`)"
- "traefik.http.routers.litellm.entrypoints=websecure"
- "traefik.http.routers.litellm.tls=true"
- "traefik.http.routers.litellm.priority=100"
- "traefik.http.services.litellm.loadbalancer.server.port=4000"

# Rate limiting
- "traefik.http.middlewares.litellm-ratelimit.ratelimit.average=100"
- "traefik.http.middlewares.litellm-ratelimit.ratelimit.burst=200"
- "traefik.http.routers.litellm.middlewares=litellm-ratelimit"

# Admin UI (litellm.haiven.local)
- "traefik.http.routers.litellm-admin.rule=Host(`litellm.haiven.local`)"
- "traefik.http.routers.litellm-admin.entrypoints=websecure"
- "traefik.http.routers.litellm-admin.tls=true"

Health Checks

LiteLLM Service

healthcheck:
  test: ["CMD-SHELL", "python3 -c \"import urllib.request; urllib.request.urlopen('http://localhost:4000/health/liveliness')\" || exit 1"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 60s

PostgreSQL Database

healthcheck:
  test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-litellm}"]
  interval: 10s
  timeout: 5s
  retries: 5

Testing Health

# LiteLLM health
curl -s https://llm.haiven.local/health
# Expected: {"status": "healthy", ...}

# Liveness probe
curl -s https://llm.haiven.local/health/liveliness
# Expected: {"status": "healthy"}

# PostgreSQL health (from container)
docker exec litellm-postgres pg_isready -U litellm

Resource Limits

LiteLLM Service

Resource Limit Reservation
Memory 2GB 1GB
CPU 2 cores 1 core

PostgreSQL Database

Resource Limit Reservation
Memory 1GB 512MB
CPU 1 core 0.5 core

Storage

Data Directories

Path Purpose Persistence
/mnt/storage/litellm-data/postgres PostgreSQL data Persistent
/app/config.yaml LiteLLM configuration Bind mount (read-only)

Backup Recommendations

# Backup PostgreSQL database
docker exec litellm-postgres pg_dump -U litellm litellm > /backup/litellm-$(date +%Y%m%d).sql

# Restore database
docker exec -i litellm-postgres psql -U litellm litellm < /backup/litellm-20251219.sql

Environment Variables

Required (in .env file)

# Database
POSTGRES_USER=litellm
POSTGRES_PASSWORD=<secure_password>
POSTGRES_DB=litellm
DATABASE_URL=postgresql://litellm:<password>@litellm-postgres:5432/litellm

# LiteLLM
LITELLM_MASTER_KEY=<master_key>

# Langfuse (observability)
LANGFUSE_PUBLIC_KEY=<public_key>
LANGFUSE_SECRET_KEY=<secret_key>
LANGFUSE_HOST=http://langfuse:3000

Set in docker-compose.yml

# Search integration
SEARXNG_API_BASE=http://searxng:8080

# Database model storage
STORE_MODEL_IN_DB=True

Docker Commands

Service Management

# Start the stack
cd /mnt/apps/docker/ai/litellm-observability/litellm
docker compose up -d

# Stop the stack
docker compose down

# Restart LiteLLM only
docker compose restart litellm

# View logs
docker logs -f litellm
docker logs -f litellm-postgres

# Check status
docker compose ps

Troubleshooting

# Check container health
docker inspect litellm --format='{{.State.Health.Status}}'

# View recent logs
docker logs --tail 100 litellm

# Enter container shell
docker exec -it litellm /bin/bash

# Check database connection
docker exec litellm-postgres psql -U litellm -c "SELECT 1"

Integration with Other Services

Echo (LibreChat)

Configure Echo to use LiteLLM as the LLM backend:

# In librechat.yaml
endpoints:
  openAI:
    baseURL: http://litellm:4000/v1
    apiKey: ${LITELLM_API_KEY}
    models:
      default: ["gpt-4", "gpt-3.5-turbo"]

Flowise

Add LiteLLM as an OpenAI-compatible endpoint:

  1. Add new "ChatOpenAI" node
  2. Base URL: http://litellm:4000/v1
  3. API Key: Your LiteLLM virtual key

MCP Server

The MCP server can use LiteLLM for LLM operations:

client = OpenAI(
    base_url="http://litellm:4000/v1",
    api_key=os.environ["LITELLM_API_KEY"]
)

Prometheus Metrics

LiteLLM exposes metrics at /metrics:

curl -s https://llm.haiven.local/metrics

Key metrics:
- litellm_requests_total - Total requests by model
- litellm_request_duration_seconds - Request latency
- litellm_tokens_total - Token usage by model
- litellm_errors_total - Error counts

Prometheus Labels

labels:
  - "prometheus.scrape=true"
  - "prometheus.port=4000"
  - "prometheus.path=/metrics"

Troubleshooting

Common Issues

1. Connection Refused

# Check if service is running
docker ps | grep litellm

# Check if port is exposed
curl -v http://localhost:4000/health

2. Database Connection Failed

# Check PostgreSQL is healthy
docker exec litellm-postgres pg_isready

# Check database exists
docker exec litellm-postgres psql -U litellm -c "\l"

3. Model Not Found

# Verify llama-swap is running
curl -s http://localhost:8081/v1/models

# Check LiteLLM config
docker exec litellm cat /app/config.yaml

4. Rate Limit Exceeded

Rate limits are set at 100 requests/second average, 200 burst.

# Check current rate limit headers
curl -v https://llm.haiven.local/v1/models 2>&1 | grep -i ratelimit

Directory Structure

/mnt/apps/docker/ai/litellm-observability/litellm/
├── docker-compose.yml      # Service definitions
├── config.yaml             # LiteLLM configuration
├── .env                    # Environment variables (secrets)
├── README.md               # This file
└── USER_GUIDE.md           # End-user documentation

/mnt/storage/litellm-data/
└── postgres/               # PostgreSQL data directory


Changelog

2025-12-22

2025-12-19