OpenAI-compatible API gateway with unified model access, virtual keys, and Langfuse observability
Status: Live
Category: AI - LLM Gateway
Maintainer: Haiven Infrastructure Team
Last Updated: 2025-12-22
LiteLLM Proxy is an OpenAI-compatible API gateway that sits in front of llama-swap and other AI services, providing:
| Purpose | URL | Notes |
|---|---|---|
| API Endpoint | https://llm.haiven.local/v1 |
OpenAI-compatible API |
| Admin UI | https://litellm.haiven.local/ui |
Key management, usage dashboard |
| Health Check | https://llm.haiven.local/health |
Service health status |
| Metrics | https://llm.haiven.local/metrics |
Prometheus metrics |
| Internal API | http://litellm:4000/v1 |
Docker network access |
| TTS Pass-through | https://llm.haiven.local/tts/v1/audio/speech |
Direct Piper TTS access |
| StyleTTS2 Pass-through | https://llm.haiven.local/styletts2/v1/audio/speech |
Direct StyleTTS2 access |
| STT Pass-through | https://llm.haiven.local/stt/v1/audio/transcriptions |
Direct Whisper access |
# Health check
curl -s https://llm.haiven.local/health
# List available models
curl -s https://llm.haiven.local/v1/models | jq '.data[].id'
# Chat completion (using master key)
curl -s https://llm.haiven.local/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-d '{
"model": "qwen3-30b-a3b",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}' | jq '.choices[0].message.content'
Navigate to https://litellm.haiven.local/ui and log in with the master key.
# Create a new API key for a user
curl -s https://llm.haiven.local/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"models": ["gpt-4", "gpt-3.5-turbo"],
"user_id": "user@example.com",
"max_budget": 100.00,
"duration": "30d"
}' | jq
+-------------------+
| Langfuse |
| (Observability) |
+--------^----------+
|
| callbacks
|
Client Request |
| |
v |
+------------+ +------------+ +----+-------+ +------------+
| Traefik | -----> | LiteLLM | -> | llama-swap | -> | GGUF Model |
| (TLS/Rate) | | (Proxy) | | (Backend) | | (GPU) |
+------------+ +------------+ +------------+ +------------+
| |
| +---------> openedai-speech (TTS: tts-1, tts-1-hd)
| |
| +---------> styletts2-openai (TTS: styletts2)
| |
| +---------> faster-whisper (STT: whisper-1, whisper-large-v3)
| |
| +---------> SearXNG (search_tools)
| |
| v
| +-------------+
| | PostgreSQL |
| | (Keys/Usage)|
+ +-------------+
llm.haiven.local
litellm.haiven.local
llm.haiven.localLiteLLM routes requests based on model name:
| Client Model Name | Actual Model | Backend |
|---|---|---|
* (wildcard) |
Same as requested | llama-swap |
gpt-4 |
qwen3-30b-a3b |
llama-swap (PRO 6000) |
gpt-4-turbo |
qwen3-30b-a3b |
llama-swap (PRO 6000) |
gpt-3.5-turbo |
qwen2.5-14b-instruct |
llama-swap (RTX 4090) |
All models available in llama-swap are accessible through LiteLLM:
# List all available models
curl -s https://llm.haiven.local/v1/models | jq '.data[].id'
Common models include:
- qwen3-30b-a3b - Qwen3 30B (Q8_0, 31B params)
- qwen2.5-14b-instruct - Qwen2.5 14B Instruct
- gemma3-27b - Gemma 3 27B
- gpt-oss-120b - OpenAI OSS 120B
| Model Name | Backend | Description |
|---|---|---|
tts-1 |
openedai-speech | Piper TTS - fast CPU-based synthesis |
tts-1-hd |
openedai-speech | XTTS - high quality voice cloning |
styletts2 |
styletts2-openai | StyleTTS2 - neural TTS with style transfer |
Available Voices: alloy, echo, fable, onyx, nova, shimmer
| Model Name | Backend | Description |
|---|---|---|
whisper-large-v3 |
faster-whisper | High accuracy GPU-accelerated transcription |
whisper-1 |
faster-whisper | OpenAI compatibility alias |
Supported Languages: 99+ languages with automatic detection
| Tool Name | Backend | Description |
|---|---|---|
searxng-search |
SearXNG | Meta-search engine for LLM function calling |
Models with supports_function_calling: true can use the search tool when tools are enabled in requests.
| Method | Endpoint | Description |
|---|---|---|
| GET | /v1/models |
List available models |
| POST | /v1/chat/completions |
Chat completions (streaming supported) |
| POST | /v1/completions |
Text completions |
| POST | /v1/embeddings |
Generate embeddings |
| POST | /v1/audio/speech |
Text-to-speech synthesis |
| POST | /v1/audio/transcriptions |
Speech-to-text transcription |
These endpoints bypass LiteLLM routing and forward directly to backends:
| Method | Endpoint | Target |
|---|---|---|
| POST | /tts/v1/audio/speech |
openedai-speech (Piper/XTTS) |
| POST | /styletts2/v1/audio/speech |
styletts2-openai wrapper |
| POST | /stt/v1/audio/transcriptions |
faster-whisper |
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Basic health check |
| GET | /health/liveliness |
Kubernetes liveness probe |
| GET | /health/readiness |
Kubernetes readiness probe |
| GET | /metrics |
Prometheus metrics |
| GET | /ui |
Admin dashboard |
| POST | /key/generate |
Create virtual API key |
| POST | /key/delete |
Delete API key |
| GET | /key/info |
Get key information |
| GET | /spend/logs |
Get spend logs |
| GET | /user/info |
Get user information |
curl -X POST https://llm.haiven.local/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "qwen3-30b-a3b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
"temperature": 0.7,
"max_tokens": 500,
"stream": false
}'
curl -X POST https://llm.haiven.local/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Write a haiku about coding"}],
"stream": true
}'
# Using Piper TTS (fast)
curl -X POST https://llm.haiven.local/v1/audio/speech \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "tts-1",
"input": "Hello, this is a test of the text-to-speech system.",
"voice": "alloy"
}' --output speech.mp3
# Using StyleTTS2 (high quality)
curl -X POST https://llm.haiven.local/v1/audio/speech \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "styletts2",
"input": "High quality neural speech synthesis.",
"voice": "nova"
}' --output speech_hq.wav
# Via pass-through endpoint (bypasses router)
curl -X POST https://llm.haiven.local/tts/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "Direct access to Piper TTS.",
"voice": "echo"
}' --output speech_direct.mp3
# Transcribe audio file
curl -X POST https://llm.haiven.local/v1/audio/transcriptions \
-H "Authorization: Bearer $API_KEY" \
-F "file=@audio.wav" \
-F "model=whisper-large-v3"
# Using OpenAI-compatible model name
curl -X POST https://llm.haiven.local/v1/audio/transcriptions \
-H "Authorization: Bearer $API_KEY" \
-F "file=@recording.mp3" \
-F "model=whisper-1"
# Via pass-through endpoint
curl -X POST https://llm.haiven.local/stt/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper-1"
curl -X POST https://llm.haiven.local/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "qwen3-30b-a3b-q8-abl",
"messages": [{"role": "user", "content": "What are the latest AI news today?"}],
"tools": [{
"type": "function",
"function": {
"name": "searxng-search",
"description": "Search the web for information"
}
}],
"tool_choice": "auto"
}'
| Network | Type | Purpose |
|---|---|---|
web |
External | Traefik routing, public access |
backend |
External | Internal service communication |
litellm-internal |
Bridge | Stack internal (LiteLLM <-> PostgreSQL) |
langfuse-internal |
External | Connection to Langfuse for tracing |
# Primary API endpoint (llm.haiven.local)
- "traefik.http.routers.litellm.rule=Host(`llm.haiven.local`)"
- "traefik.http.routers.litellm.entrypoints=websecure"
- "traefik.http.routers.litellm.tls=true"
- "traefik.http.routers.litellm.priority=100"
- "traefik.http.services.litellm.loadbalancer.server.port=4000"
# Rate limiting
- "traefik.http.middlewares.litellm-ratelimit.ratelimit.average=100"
- "traefik.http.middlewares.litellm-ratelimit.ratelimit.burst=200"
- "traefik.http.routers.litellm.middlewares=litellm-ratelimit"
# Admin UI (litellm.haiven.local)
- "traefik.http.routers.litellm-admin.rule=Host(`litellm.haiven.local`)"
- "traefik.http.routers.litellm-admin.entrypoints=websecure"
- "traefik.http.routers.litellm-admin.tls=true"
healthcheck:
test: ["CMD-SHELL", "python3 -c \"import urllib.request; urllib.request.urlopen('http://localhost:4000/health/liveliness')\" || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-litellm}"]
interval: 10s
timeout: 5s
retries: 5
# LiteLLM health
curl -s https://llm.haiven.local/health
# Expected: {"status": "healthy", ...}
# Liveness probe
curl -s https://llm.haiven.local/health/liveliness
# Expected: {"status": "healthy"}
# PostgreSQL health (from container)
docker exec litellm-postgres pg_isready -U litellm
| Resource | Limit | Reservation |
|---|---|---|
| Memory | 2GB | 1GB |
| CPU | 2 cores | 1 core |
| Resource | Limit | Reservation |
|---|---|---|
| Memory | 1GB | 512MB |
| CPU | 1 core | 0.5 core |
| Path | Purpose | Persistence |
|---|---|---|
/mnt/storage/litellm-data/postgres |
PostgreSQL data | Persistent |
/app/config.yaml |
LiteLLM configuration | Bind mount (read-only) |
# Backup PostgreSQL database
docker exec litellm-postgres pg_dump -U litellm litellm > /backup/litellm-$(date +%Y%m%d).sql
# Restore database
docker exec -i litellm-postgres psql -U litellm litellm < /backup/litellm-20251219.sql
# Database
POSTGRES_USER=litellm
POSTGRES_PASSWORD=<secure_password>
POSTGRES_DB=litellm
DATABASE_URL=postgresql://litellm:<password>@litellm-postgres:5432/litellm
# LiteLLM
LITELLM_MASTER_KEY=<master_key>
# Langfuse (observability)
LANGFUSE_PUBLIC_KEY=<public_key>
LANGFUSE_SECRET_KEY=<secret_key>
LANGFUSE_HOST=http://langfuse:3000
# Search integration
SEARXNG_API_BASE=http://searxng:8080
# Database model storage
STORE_MODEL_IN_DB=True
# Start the stack
cd /mnt/apps/docker/ai/litellm-observability/litellm
docker compose up -d
# Stop the stack
docker compose down
# Restart LiteLLM only
docker compose restart litellm
# View logs
docker logs -f litellm
docker logs -f litellm-postgres
# Check status
docker compose ps
# Check container health
docker inspect litellm --format='{{.State.Health.Status}}'
# View recent logs
docker logs --tail 100 litellm
# Enter container shell
docker exec -it litellm /bin/bash
# Check database connection
docker exec litellm-postgres psql -U litellm -c "SELECT 1"
Configure Echo to use LiteLLM as the LLM backend:
# In librechat.yaml
endpoints:
openAI:
baseURL: http://litellm:4000/v1
apiKey: ${LITELLM_API_KEY}
models:
default: ["gpt-4", "gpt-3.5-turbo"]
Add LiteLLM as an OpenAI-compatible endpoint:
http://litellm:4000/v1The MCP server can use LiteLLM for LLM operations:
client = OpenAI(
base_url="http://litellm:4000/v1",
api_key=os.environ["LITELLM_API_KEY"]
)
LiteLLM exposes metrics at /metrics:
curl -s https://llm.haiven.local/metrics
Key metrics:
- litellm_requests_total - Total requests by model
- litellm_request_duration_seconds - Request latency
- litellm_tokens_total - Token usage by model
- litellm_errors_total - Error counts
labels:
- "prometheus.scrape=true"
- "prometheus.port=4000"
- "prometheus.path=/metrics"
# Check if service is running
docker ps | grep litellm
# Check if port is exposed
curl -v http://localhost:4000/health
# Check PostgreSQL is healthy
docker exec litellm-postgres pg_isready
# Check database exists
docker exec litellm-postgres psql -U litellm -c "\l"
# Verify llama-swap is running
curl -s http://localhost:8081/v1/models
# Check LiteLLM config
docker exec litellm cat /app/config.yaml
Rate limits are set at 100 requests/second average, 200 burst.
# Check current rate limit headers
curl -v https://llm.haiven.local/v1/models 2>&1 | grep -i ratelimit
/mnt/apps/docker/ai/litellm-observability/litellm/
├── docker-compose.yml # Service definitions
├── config.yaml # LiteLLM configuration
├── .env # Environment variables (secrets)
├── README.md # This file
└── USER_GUIDE.md # End-user documentation
/mnt/storage/litellm-data/
└── postgres/ # PostgreSQL data directory
tts-1 via openedai-speech (Piper TTS)tts-1-hd via openedai-speech (XTTS high quality)styletts2 via styletts2-openai wrapperwhisper-large-v3 via faster-whisperwhisper-1 (OpenAI compatibility alias)SEARXNG_API_BASE environment variablesearxng-search tool for models with function calling/tts/v1/audio/speech -> openedai-speech/styletts2/v1/audio/speech -> styletts2-openai/stt/v1/audio/transcriptions -> faster-whisper