LiteLLM Proxy

OpenAI-compatible API gateway with unified model access, virtual keys, and Langfuse observability

Status: Live
Category: AI - LLM Gateway
Maintainer: Haiven Infrastructure Team
Last Updated: 2025-12-22

Overview

LiteLLM Proxy is an OpenAI-compatible API gateway that sits in front of llama-swap and other AI services, providing:

Unified API Interface - Single endpoint for all LLM models via OpenAI-compatible API
Text-to-Speech (TTS) - Piper (tts-1), XTTS (tts-1-hd), and StyleTTS2 integration
Speech-to-Text (STT) - Faster-Whisper for OpenAI-compatible transcription
Web Search Tools - SearXNG integration for LLM function calling
Request/Response Logging - Full observability through Langfuse integration
Virtual API Keys - User/team-based spend tracking and access control
Rate Limiting - Configurable request rate limits via Traefik
Model Aliasing - Map standard model names (gpt-4, gpt-3.5-turbo) to local models
Pass-through Endpoints - Direct access to TTS/STT backends via dedicated paths

Access Points

Purpose	URL	Notes
API Endpoint	`https://llm.haiven.local/v1`	OpenAI-compatible API
Admin UI	`https://litellm.haiven.local/ui`	Key management, usage dashboard
Health Check	`https://llm.haiven.local/health`	Service health status
Metrics	`https://llm.haiven.local/metrics`	Prometheus metrics
Internal API	`http://litellm:4000/v1`	Docker network access
TTS Pass-through	`https://llm.haiven.local/tts/v1/audio/speech`	Direct Piper TTS access
StyleTTS2 Pass-through	`https://llm.haiven.local/styletts2/v1/audio/speech`	Direct StyleTTS2 access
STT Pass-through	`https://llm.haiven.local/stt/v1/audio/transcriptions`	Direct Whisper access

Quick Start

1. Test API Access

# Health check
curl -s https://llm.haiven.local/health

# List available models
curl -s https://llm.haiven.local/v1/models | jq '.data[].id'

# Chat completion (using master key)
curl -s https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }' | jq '.choices[0].message.content'

2. Access Admin UI

Navigate to https://litellm.haiven.local/ui and log in with the master key.

3. Create Virtual Keys

# Create a new API key for a user
curl -s https://llm.haiven.local/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["gpt-4", "gpt-3.5-turbo"],
    "user_id": "user@example.com",
    "max_budget": 100.00,
    "duration": "30d"
  }' | jq

Architecture

                                    +-------------------+
                                    |    Langfuse       |
                                    | (Observability)   |
                                    +--------^----------+
                                             |
                                             | callbacks
                                             |
Client Request                               |
      |                                      |
      v                                      |
+------------+        +------------+    +----+-------+    +------------+
|  Traefik   | -----> |  LiteLLM   | -> | llama-swap | -> | GGUF Model |
| (TLS/Rate) |        |  (Proxy)   |    | (Backend)  |    | (GPU)      |
+------------+        +------------+    +------------+    +------------+
      |                     |
      |                     +---------> openedai-speech (TTS: tts-1, tts-1-hd)
      |                     |
      |                     +---------> styletts2-openai (TTS: styletts2)
      |                     |
      |                     +---------> faster-whisper (STT: whisper-1, whisper-large-v3)
      |                     |
      |                     +---------> SearXNG (search_tools)
      |                     |
      |                     v
      |              +-------------+
      |              | PostgreSQL  |
      |              | (Keys/Usage)|
      +              +-------------+
 llm.haiven.local
 litellm.haiven.local

Request Flow

Client sends request to llm.haiven.local
Traefik terminates TLS, applies rate limiting (100 avg / 200 burst)
LiteLLM validates API key and applies model routing
Request routed based on model type:
- LLM models -> llama-swap backend -> GPU model inference
- TTS models (tts-1, tts-1-hd) -> openedai-speech -> Piper/XTTS synthesis
- TTS models (styletts2) -> styletts2-openai -> StyleTTS2 synthesis
- STT models (whisper-1, whisper-large-v3) -> faster-whisper -> Audio transcription
Response logged to Langfuse for observability
Usage tracked in PostgreSQL database

Model Configuration

Model Mappings

LiteLLM routes requests based on model name:

Client Model Name	Actual Model	Backend
`*` (wildcard)	Same as requested	llama-swap
`gpt-4`	`qwen3-30b-a3b`	llama-swap (PRO 6000)
`gpt-4-turbo`	`qwen3-30b-a3b`	llama-swap (PRO 6000)
`gpt-3.5-turbo`	`qwen2.5-14b-instruct`	llama-swap (RTX 4090)

Available Models

All models available in llama-swap are accessible through LiteLLM:

# List all available models
curl -s https://llm.haiven.local/v1/models | jq '.data[].id'

Common models include:
- qwen3-30b-a3b - Qwen3 30B (Q8_0, 31B params)
- qwen2.5-14b-instruct - Qwen2.5 14B Instruct
- gemma3-27b - Gemma 3 27B
- gpt-oss-120b - OpenAI OSS 120B

TTS Models (Text-to-Speech)

Model Name	Backend	Description
`tts-1`	openedai-speech	Piper TTS - fast CPU-based synthesis
`tts-1-hd`	openedai-speech	XTTS - high quality voice cloning
`styletts2`	styletts2-openai	StyleTTS2 - neural TTS with style transfer

Available Voices: alloy, echo, fable, onyx, nova, shimmer

STT Models (Speech-to-Text)

Model Name	Backend	Description
`whisper-large-v3`	faster-whisper	High accuracy GPU-accelerated transcription
`whisper-1`	faster-whisper	OpenAI compatibility alias

Supported Languages: 99+ languages with automatic detection

Search Tools

Tool Name	Backend	Description
`searxng-search`	SearXNG	Meta-search engine for LLM function calling

Models with supports_function_calling: true can use the search tool when tools are enabled in requests.

API Endpoints

OpenAI-Compatible Endpoints

Method	Endpoint	Description
GET	`/v1/models`	List available models
POST	`/v1/chat/completions`	Chat completions (streaming supported)
POST	`/v1/completions`	Text completions
POST	`/v1/embeddings`	Generate embeddings
POST	`/v1/audio/speech`	Text-to-speech synthesis
POST	`/v1/audio/transcriptions`	Speech-to-text transcription

Pass-through Endpoints

These endpoints bypass LiteLLM routing and forward directly to backends:

Method	Endpoint	Target
POST	`/tts/v1/audio/speech`	openedai-speech (Piper/XTTS)
POST	`/styletts2/v1/audio/speech`	styletts2-openai wrapper
POST	`/stt/v1/audio/transcriptions`	faster-whisper

LiteLLM-Specific Endpoints

Method	Endpoint	Description
GET	`/health`	Basic health check
GET	`/health/liveliness`	Kubernetes liveness probe
GET	`/health/readiness`	Kubernetes readiness probe
GET	`/metrics`	Prometheus metrics
GET	`/ui`	Admin dashboard
POST	`/key/generate`	Create virtual API key
POST	`/key/delete`	Delete API key
GET	`/key/info`	Get key information
GET	`/spend/logs`	Get spend logs
GET	`/user/info`	Get user information

Example: Chat Completion

curl -X POST https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "temperature": 0.7,
    "max_tokens": 500,
    "stream": false
  }'

Example: Streaming Response

curl -X POST https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Write a haiku about coding"}],
    "stream": true
  }'

Example: Text-to-Speech (TTS)

# Using Piper TTS (fast)
curl -X POST https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "tts-1",
    "input": "Hello, this is a test of the text-to-speech system.",
    "voice": "alloy"
  }' --output speech.mp3

# Using StyleTTS2 (high quality)
curl -X POST https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "styletts2",
    "input": "High quality neural speech synthesis.",
    "voice": "nova"
  }' --output speech_hq.wav

# Via pass-through endpoint (bypasses router)
curl -X POST https://llm.haiven.local/tts/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Direct access to Piper TTS.",
    "voice": "echo"
  }' --output speech_direct.mp3

Example: Speech-to-Text (STT)

# Transcribe audio file
curl -X POST https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@audio.wav" \
  -F "model=whisper-large-v3"

# Using OpenAI-compatible model name
curl -X POST https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@recording.mp3" \
  -F "model=whisper-1"

# Via pass-through endpoint
curl -X POST https://llm.haiven.local/stt/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper-1"

Example: Web Search with Function Calling

curl -X POST https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "qwen3-30b-a3b-q8-abl",
    "messages": [{"role": "user", "content": "What are the latest AI news today?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "searxng-search",
        "description": "Search the web for information"
      }
    }],
    "tool_choice": "auto"
  }'

Network Configuration

Networks

Network	Type	Purpose
`web`	External	Traefik routing, public access
`backend`	External	Internal service communication
`litellm-internal`	Bridge	Stack internal (LiteLLM <-> PostgreSQL)
`langfuse-internal`	External	Connection to Langfuse for tracing

Traefik Labels

# Primary API endpoint (llm.haiven.local)
- "traefik.http.routers.litellm.rule=Host(`llm.haiven.local`)"
- "traefik.http.routers.litellm.entrypoints=websecure"
- "traefik.http.routers.litellm.tls=true"
- "traefik.http.routers.litellm.priority=100"
- "traefik.http.services.litellm.loadbalancer.server.port=4000"

# Rate limiting
- "traefik.http.middlewares.litellm-ratelimit.ratelimit.average=100"
- "traefik.http.middlewares.litellm-ratelimit.ratelimit.burst=200"
- "traefik.http.routers.litellm.middlewares=litellm-ratelimit"

# Admin UI (litellm.haiven.local)
- "traefik.http.routers.litellm-admin.rule=Host(`litellm.haiven.local`)"
- "traefik.http.routers.litellm-admin.entrypoints=websecure"
- "traefik.http.routers.litellm-admin.tls=true"

Health Checks

LiteLLM Service

healthcheck:
  test: ["CMD-SHELL", "python3 -c \"import urllib.request; urllib.request.urlopen('http://localhost:4000/health/liveliness')\" || exit 1"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 60s

PostgreSQL Database

healthcheck:
  test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-litellm}"]
  interval: 10s
  timeout: 5s
  retries: 5

Testing Health

# LiteLLM health
curl -s https://llm.haiven.local/health
# Expected: {"status": "healthy", ...}

# Liveness probe
curl -s https://llm.haiven.local/health/liveliness
# Expected: {"status": "healthy"}

# PostgreSQL health (from container)
docker exec litellm-postgres pg_isready -U litellm

Resource Limits

LiteLLM Service

Resource	Limit	Reservation
Memory	2GB	1GB
CPU	2 cores	1 core

PostgreSQL Database

Resource	Limit	Reservation
Memory	1GB	512MB
CPU	1 core	0.5 core

Storage

Data Directories

Path	Purpose	Persistence
`/mnt/storage/litellm-data/postgres`	PostgreSQL data	Persistent
`/app/config.yaml`	LiteLLM configuration	Bind mount (read-only)

Backup Recommendations

# Backup PostgreSQL database
docker exec litellm-postgres pg_dump -U litellm litellm > /backup/litellm-$(date +%Y%m%d).sql

# Restore database
docker exec -i litellm-postgres psql -U litellm litellm < /backup/litellm-20251219.sql

Environment Variables

Required (in .env file)

# Database
POSTGRES_USER=litellm
POSTGRES_PASSWORD=<secure_password>
POSTGRES_DB=litellm
DATABASE_URL=postgresql://litellm:<password>@litellm-postgres:5432/litellm

# LiteLLM
LITELLM_MASTER_KEY=<master_key>

# Langfuse (observability)
LANGFUSE_PUBLIC_KEY=<public_key>
LANGFUSE_SECRET_KEY=<secret_key>
LANGFUSE_HOST=http://langfuse:3000

Set in docker-compose.yml

# Search integration
SEARXNG_API_BASE=http://searxng:8080

# Database model storage
STORE_MODEL_IN_DB=True

Docker Commands

Service Management

# Start the stack
cd /mnt/apps/docker/ai/litellm-observability/litellm
docker compose up -d

# Stop the stack
docker compose down

# Restart LiteLLM only
docker compose restart litellm

# View logs
docker logs -f litellm
docker logs -f litellm-postgres

# Check status
docker compose ps

Troubleshooting

# Check container health
docker inspect litellm --format='{{.State.Health.Status}}'

# View recent logs
docker logs --tail 100 litellm

# Enter container shell
docker exec -it litellm /bin/bash

# Check database connection
docker exec litellm-postgres psql -U litellm -c "SELECT 1"

Integration with Other Services

Echo (LibreChat)

Configure Echo to use LiteLLM as the LLM backend:

# In librechat.yaml
endpoints:
  openAI:
    baseURL: http://litellm:4000/v1
    apiKey: ${LITELLM_API_KEY}
    models:
      default: ["gpt-4", "gpt-3.5-turbo"]

Flowise

Add LiteLLM as an OpenAI-compatible endpoint:

Add new "ChatOpenAI" node
Base URL: http://litellm:4000/v1
API Key: Your LiteLLM virtual key

MCP Server

The MCP server can use LiteLLM for LLM operations:

client = OpenAI(
    base_url="http://litellm:4000/v1",
    api_key=os.environ["LITELLM_API_KEY"]
)

Prometheus Metrics

LiteLLM exposes metrics at /metrics:

curl -s https://llm.haiven.local/metrics

Key metrics:
- litellm_requests_total - Total requests by model
- litellm_request_duration_seconds - Request latency
- litellm_tokens_total - Token usage by model
- litellm_errors_total - Error counts

Prometheus Labels

labels:
  - "prometheus.scrape=true"
  - "prometheus.port=4000"
  - "prometheus.path=/metrics"

Troubleshooting

Common Issues

1. Connection Refused

# Check if service is running
docker ps | grep litellm

# Check if port is exposed
curl -v http://localhost:4000/health

2. Database Connection Failed

# Check PostgreSQL is healthy
docker exec litellm-postgres pg_isready

# Check database exists
docker exec litellm-postgres psql -U litellm -c "\l"

3. Model Not Found

# Verify llama-swap is running
curl -s http://localhost:8081/v1/models

# Check LiteLLM config
docker exec litellm cat /app/config.yaml

4. Rate Limit Exceeded

Rate limits are set at 100 requests/second average, 200 burst.

# Check current rate limit headers
curl -v https://llm.haiven.local/v1/models 2>&1 | grep -i ratelimit

Directory Structure

/mnt/apps/docker/ai/litellm-observability/litellm/
├── docker-compose.yml      # Service definitions
├── config.yaml             # LiteLLM configuration
├── .env                    # Environment variables (secrets)
├── README.md               # This file
└── USER_GUIDE.md           # End-user documentation

/mnt/storage/litellm-data/
└── postgres/               # PostgreSQL data directory

Changelog

2025-12-22

TTS Integration: Added text-to-speech models
tts-1 via openedai-speech (Piper TTS)
tts-1-hd via openedai-speech (XTTS high quality)
styletts2 via styletts2-openai wrapper
STT Integration: Added speech-to-text models
whisper-large-v3 via faster-whisper
whisper-1 (OpenAI compatibility alias)
Search Tools: Added SearXNG integration for LLM function calling
SEARXNG_API_BASE environment variable
searxng-search tool for models with function calling
Pass-through Endpoints: Added direct backend access
/tts/v1/audio/speech -> openedai-speech
/styletts2/v1/audio/speech -> styletts2-openai
/stt/v1/audio/transcriptions -> faster-whisper

2025-12-19

Initial deployment
OpenAI-compatible API gateway
Langfuse integration for observability
Virtual key management
Model aliasing for gpt-4/gpt-3.5-turbo

LiteLLM Proxy

Overview

Access Points

Quick Start

1. Test API Access

2. Access Admin UI

3. Create Virtual Keys

Architecture

Request Flow

Model Configuration

Model Mappings

Available Models

TTS Models (Text-to-Speech)

STT Models (Speech-to-Text)

Search Tools

API Endpoints

OpenAI-Compatible Endpoints

Pass-through Endpoints

LiteLLM-Specific Endpoints

Example: Chat Completion

Example: Streaming Response

Example: Text-to-Speech (TTS)

Example: Speech-to-Text (STT)

Example: Web Search with Function Calling

Network Configuration

Networks

Traefik Labels

Health Checks

LiteLLM Service

PostgreSQL Database

Testing Health

Resource Limits

LiteLLM Service

PostgreSQL Database

Storage

Data Directories

Backup Recommendations

Environment Variables

Required (in .env file)

Set in docker-compose.yml

Docker Commands

Service Management

Troubleshooting

Integration with Other Services

Echo (LibreChat)

Flowise

MCP Server

Prometheus Metrics

Prometheus Labels

Troubleshooting

Common Issues

1. Connection Refused

2. Database Connection Failed

3. Model Not Found

4. Rate Limit Exceeded

Directory Structure

Related Documentation

Changelog

2025-12-22

2025-12-19