LiteLLM is your central hub for accessing all AI capabilities on Haiven. It provides a standard OpenAI-compatible API, so any tool or library that works with OpenAI will work with LiteLLM.

What You Can Do

Access URLs

Accessing the API

From Command Line (curl)

What	Where
API Endpoint	`https://llm.haiven.local/v1`
Admin Dashboard	`https://litellm.haiven.local/ui`
Health Status	`https://llm.haiven.local/health`
TTS Pass-through (Piper)	`https://llm.haiven.local/tts/v1/audio/speech`
TTS Pass-through (StyleTTS2)	`https://llm.haiven.local/styletts2/v1/audio/speech`
STT Pass-through	`https://llm.haiven.local/stt/v1/audio/transcriptions`

# Simple chat request
curl https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

From Python

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

response = client.chat.completions.create(
    model="qwen3-30b-a3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's the weather like?"}
    ]
)

print(response.choices[0].message.content)

From JavaScript/Node.js

import OpenAI from 'openai';

const openai = new OpenAI({
  baseURL: 'https://llm.haiven.local/v1',
  apiKey: 'YOUR_API_KEY',
});

const response = await openai.chat.completions.create({
  model: 'qwen3-30b-a3b',
  messages: [{ role: 'user', content: 'Hello!' }],
});

console.log(response.choices[0].message.content);

Internal Docker Access

client = OpenAI(
    base_url="http://litellm:4000/v1",
    api_key="YOUR_API_KEY"
)

Using the Admin UI

Accessing the Dashboard

Dashboard Features

Creating a New API Key

Common Use Cases

1. Simple Chat Conversation

Tab	What It Shows
Dashboard	Usage statistics, request counts, token usage
Keys	Create and manage API keys
Models	Available models and their configurations
Usage	Detailed spend and usage logs
Settings	Configuration options

curl https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "gpt-4",
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant."},
      {"role": "user", "content": "How do I read a file in Python?"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

2. Streaming Response

curl https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [{"role": "user", "content": "Write a short story"}],
    "stream": true
  }'

3. Code Generation

curl https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [
      {"role": "system", "content": "You are an expert Python programmer. Write clean, efficient code."},
      {"role": "user", "content": "Write a function to calculate Fibonacci numbers with memoization"}
    ],
    "temperature": 0.2
  }'

4. Multi-turn Conversation

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

messages = [
    {"role": "system", "content": "You are a helpful tutor."}
]

# First turn
messages.append({"role": "user", "content": "Explain what a variable is in programming"})
response = client.chat.completions.create(model="gpt-4", messages=messages)
messages.append({"role": "assistant", "content": response.choices[0].message.content})

# Second turn
messages.append({"role": "user", "content": "Can you give me an example?"})
response = client.chat.completions.create(model="gpt-4", messages=messages)
print(response.choices[0].message.content)

5. JSON Mode Response

curl https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant that responds in JSON format."},
      {"role": "user", "content": "List 3 programming languages with their key features"}
    ],
    "response_format": {"type": "json_object"}
  }'

Working with Models

Available LLM Models

Model Aliases

List All Models

Model Name	Best For	Size
`qwen3-30b-a3b`	General purpose, coding	30B params
`qwen2.5-14b-instruct`	Fast responses	14B params
`gemma3-27b`	General purpose	27B params
`gpt-oss-120b`	Complex reasoning	120B params

Alias	Maps To
`gpt-4`	`qwen3-30b-a3b`
`gpt-4-turbo`	`qwen3-30b-a3b`
`gpt-3.5-turbo`	`qwen2.5-14b-instruct`

curl https://llm.haiven.local/v1/models \
  -H "Authorization: Bearer $API_KEY" | jq '.data[].id'

Choosing the Right Model

Text-to-Speech (TTS) Deep Dive

LiteLLM provides access to three TTS engines, each with different characteristics.

TTS Engines Comparison

Available Voices

Basic TTS Usage

Model	Engine	Speed	Quality	GPU Required	Best For
`tts-1`	Piper (ONNX)	Very Fast	Good	No (CPU)	Quick responses, notifications
`tts-1-hd`	XTTS	Medium	High	No (CPU)	Voice cloning, professional audio
`styletts2`	StyleTTS2	Slow	Highest	Yes (RTX 4090)	Maximum quality, style transfer

Voice	Description	Character
`alloy`	Neutral, balanced	Professional, clear
`echo`	Male, warm	Approachable, friendly
`fable`	British accent	Storyteller, narrator
`onyx`	Male, deep	Authoritative, commanding
`nova`	Female, friendly	Conversational, warm
`shimmer`	Female, expressive	Energetic, enthusiastic

# Using Piper TTS (fastest)
curl https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "tts-1",
    "input": "Hello! Welcome to Haiven.",
    "voice": "alloy"
  }' --output hello.mp3

High-Quality TTS with StyleTTS2

# Using StyleTTS2 (highest quality)
curl https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "styletts2",
    "input": "This is professional-quality neural speech synthesis.",
    "voice": "nova"
  }' --output professional.wav

Python TTS Examples

from openai import OpenAI
from pathlib import Path

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

# Quick TTS with Piper
def quick_speak(text: str, output_file: str = "output.mp3"):
    """Fast TTS for notifications and quick responses."""
    response = client.audio.speech.create(
        model="tts-1",
        voice="alloy",
        input=text
    )
    response.stream_to_file(Path(output_file))
    return output_file

# High-quality TTS with StyleTTS2
def professional_speak(text: str, voice: str = "nova", output_file: str = "professional.wav"):
    """High-quality TTS for professional content."""
    response = client.audio.speech.create(
        model="styletts2",
        voice=voice,
        input=text
    )
    response.stream_to_file(Path(output_file))
    return output_file

# Generate all voices for comparison
def generate_voice_samples(text: str):
    """Generate samples of all available voices."""
    voices = ["alloy", "echo", "fable", "onyx", "nova", "shimmer"]

    for voice in voices:
        response = client.audio.speech.create(
            model="tts-1",
            voice=voice,
            input=text
        )
        response.stream_to_file(Path(f"sample_{voice}.mp3"))
        print(f"Generated sample_{voice}.mp3")

# Usage
quick_speak("You have a new message.")
professional_speak("Welcome to our quarterly earnings call.")
generate_voice_samples("The quick brown fox jumps over the lazy dog.")

Audio Output Formats

Format	Extension	Description	File Size
`mp3`	.mp3	Most compatible, lossy	Medium
`opus`	.opus	Best compression, lossy	Small
`aac`	.aac	Apple devices, lossy	Medium
`flac`	.flac	Lossless compression	Large
`wav`	.wav	Uncompressed, lossless	Very Large
`pcm`	.pcm	Raw audio, lossless	Very Large

# Generate in different formats
curl https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "tts-1",
    "input": "Testing different audio formats.",
    "voice": "alloy",
    "response_format": "opus"
  }' --output speech.opus

# High-quality WAV for editing
curl https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "styletts2",
    "input": "Uncompressed audio for post-processing.",
    "voice": "nova",
    "response_format": "wav"
  }' --output speech.wav

Speed Control

# Slow narration (0.75x speed)
curl https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "tts-1",
    "input": "This is spoken slowly for clarity and emphasis.",
    "voice": "fable",
    "speed": 0.75
  }' --output slow.mp3

# Fast reading (1.5x speed)
curl https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "tts-1",
    "input": "This is spoken quickly for time-sensitive information.",
    "voice": "alloy",
    "speed": 1.5
  }' --output fast.mp3

TTS Use Case Examples

from openai import OpenAI
from pathlib import Path

client = OpenAI(base_url="https://llm.haiven.local/v1", api_key="YOUR_API_KEY")

# 1. Notification Sound
def create_notification(message: str):
    response = client.audio.speech.create(
        model="tts-1",  # Fast
        voice="alloy",
        input=message,
        speed=1.2  # Slightly faster
    )
    response.stream_to_file(Path("notification.mp3"))

# 2. Podcast Intro
def create_podcast_intro(show_name: str, episode_title: str):
    script = f"Welcome to {show_name}. Today's episode: {episode_title}."
    response = client.audio.speech.create(
        model="styletts2",  # High quality
        voice="onyx",  # Authoritative
        input=script
    )
    response.stream_to_file(Path("podcast_intro.wav"))

# 3. Audiobook Chapter
def narrate_chapter(text: str, chapter_num: int):
    response = client.audio.speech.create(
        model="styletts2",
        voice="fable",  # British narrator
        input=text,
        response_format="flac",  # Lossless for editing
        speed=0.9  # Slightly slower for clarity
    )
    response.stream_to_file(Path(f"chapter_{chapter_num}.flac"))

# 4. Voice Assistant Response
def assistant_response(text: str):
    response = client.audio.speech.create(
        model="tts-1",  # Fast response
        voice="nova",  # Friendly female
        input=text
    )
    return response.content  # Return bytes for immediate playback

# 5. Multi-language Announcement (using different voices)
def announcement(en_text: str, output_prefix: str):
    voices_per_region = {
        "us": "alloy",
        "uk": "fable",
        "casual": "nova"
    }

    for region, voice in voices_per_region.items():
        response = client.audio.speech.create(
            model="tts-1",
            voice=voice,
            input=en_text
        )
        response.stream_to_file(Path(f"{output_prefix}_{region}.mp3"))

Batch TTS Generation

import asyncio
from openai import AsyncOpenAI
from pathlib import Path

async_client = AsyncOpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

async def batch_tts(texts: list[str], voice: str = "alloy"):
    """Generate TTS for multiple texts concurrently."""
    async def generate_one(idx: int, text: str):
        response = await async_client.audio.speech.create(
            model="tts-1",
            voice=voice,
            input=text
        )
        output_path = Path(f"output_{idx}.mp3")
        response.stream_to_file(output_path)
        return output_path

    tasks = [generate_one(i, text) for i, text in enumerate(texts)]
    results = await asyncio.gather(*tasks)
    return results

# Usage
texts = [
    "First notification message.",
    "Second notification message.",
    "Third notification message."
]
asyncio.run(batch_tts(texts))

Speech-to-Text (STT) Deep Dive

STT Models

Both models run on GPU for fast transcription and support automatic language detection.

Basic Transcription

Model	Speed	Accuracy	Languages
`whisper-1`	Fast	High	99+ languages
`whisper-large-v3`	Medium	Highest	99+ languages

# Simple transcription
curl https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@recording.mp3" \
  -F "model=whisper-1"

{
  "text": "Hello, this is a test recording of the speech recognition system."
}

Transcription with Timestamps

# Get word-level timestamps
curl https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@interview.wav" \
  -F "model=whisper-large-v3" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=word"

{
  "task": "transcribe",
  "language": "english",
  "duration": 15.5,
  "text": "Welcome to the interview.",
  "words": [
    {"word": "Welcome", "start": 0.0, "end": 0.5},
    {"word": "to", "start": 0.5, "end": 0.7},
    {"word": "the", "start": 0.7, "end": 0.9},
    {"word": "interview", "start": 0.9, "end": 1.5}
  ]
}

Segment-Level Timestamps

# Get segment-level timestamps (sentences/phrases)
curl https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@podcast.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=segment"

{
  "task": "transcribe",
  "language": "english",
  "duration": 120.5,
  "text": "Welcome to the show. Today we discuss AI.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Welcome to the show.",
      "tokens": [50364, 5765, 281, 264, 1656, 13],
      "temperature": 0.0,
      "avg_logprob": -0.25,
      "compression_ratio": 1.2,
      "no_speech_prob": 0.01
    },
    {
      "id": 1,
      "seek": 250,
      "start": 2.5,
      "end": 5.0,
      "text": "Today we discuss AI.",
      "tokens": [50364, 2692, 321, 2248, 7318, 13],
      "temperature": 0.0,
      "avg_logprob": -0.18,
      "compression_ratio": 1.1,
      "no_speech_prob": 0.02
    }
  ]
}

Language-Specific Transcription

# Transcribe Spanish audio
curl https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@spanish_audio.mp3" \
  -F "model=whisper-large-v3" \
  -F "language=es"

# Transcribe Japanese audio
curl https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@japanese_audio.wav" \
  -F "model=whisper-large-v3" \
  -F "language=ja"

Audio Translation (to English)

# Translate any language audio to English text
curl https://llm.haiven.local/v1/audio/translations \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@french_speech.mp3" \
  -F "model=whisper-large-v3"

Response Formats

Format	Description	Use Case
`json`	Simple JSON with text	Default, simple integration
`text`	Plain text only	Minimal processing
`srt`	SubRip subtitle format	Video subtitles
`verbose_json`	Full details with timestamps	Analytics, editing
`vtt`	WebVTT subtitle format	Web video players

# Generate SRT subtitles
curl https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@video_audio.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=srt" \
  --output subtitles.srt

# Generate VTT subtitles for web
curl https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@webinar.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=vtt" \
  --output captions.vtt

Python STT Examples

from openai import OpenAI
from pathlib import Path

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

# 1. Simple Transcription
def transcribe_audio(file_path: str) -> str:
    """Transcribe audio file to text."""
    with open(file_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file
        )
    return transcript.text

# 2. Transcription with Timestamps
def transcribe_with_timestamps(file_path: str) -> dict:
    """Transcribe with word-level timestamps."""
    with open(file_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=audio_file,
            response_format="verbose_json",
            timestamp_granularities=["word", "segment"]
        )
    return transcript

# 3. Generate Subtitles
def generate_subtitles(file_path: str, format: str = "srt") -> str:
    """Generate subtitle file from audio."""
    with open(file_path, "rb") as audio_file:
        result = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=audio_file,
            response_format=format
        )

    output_path = Path(file_path).stem + f".{format}"
    with open(output_path, "w") as f:
        f.write(result)
    return output_path

# 4. Translate Foreign Audio to English
def translate_to_english(file_path: str) -> str:
    """Translate foreign language audio to English text."""
    with open(file_path, "rb") as audio_file:
        translation = client.audio.translations.create(
            model="whisper-large-v3",
            file=audio_file
        )
    return translation.text

# 5. Transcribe with Specific Language
def transcribe_language(file_path: str, language: str) -> str:
    """Transcribe audio in a specific language."""
    with open(file_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=audio_file,
            language=language  # e.g., "es", "fr", "de", "ja", "zh"
        )
    return transcript.text

# 6. Meeting Transcription with Speaker Diarization Prep
def meeting_transcription(file_path: str) -> dict:
    """Transcribe meeting with segments for speaker identification."""
    with open(file_path, "rb") as audio_file:
        result = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=audio_file,
            response_format="verbose_json",
            timestamp_granularities=["segment"]
        )

    # Process segments for meeting minutes
    segments = []
    for seg in result.segments:
        segments.append({
            "start": seg["start"],
            "end": seg["end"],
            "text": seg["text"],
            "duration": seg["end"] - seg["start"]
        })

    return {
        "full_text": result.text,
        "duration": result.duration,
        "segments": segments
    }

# Usage examples
text = transcribe_audio("recording.mp3")
print(f"Transcription: {text}")

detailed = transcribe_with_timestamps("interview.wav")
print(f"Duration: {detailed.duration}s")
for word in detailed.words[:5]:
    print(f"  {word['word']} ({word['start']:.2f}s - {word['end']:.2f}s)")

subtitles = generate_subtitles("video.mp3", "srt")
print(f"Subtitles saved to: {subtitles}")

english = translate_to_english("spanish_podcast.mp3")
print(f"English translation: {english}")

Supported Audio Formats

Web Search Integration

LiteLLM integrates with SearXNG to enable AI models to search the web and provide up-to-date information.

Basic Web Search

Format	Extension	Max Size
MP3	.mp3	25MB
MP4 Audio	.mp4, .m4a	25MB
WAV	.wav	25MB
WebM	.webm	25MB
MPEG	.mpeg, .mpga	25MB
OGG	.ogg	25MB
FLAC	.flac	25MB

curl https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "qwen3-30b-a3b-q8-abl",
    "messages": [
      {"role": "user", "content": "What are the latest developments in AI this week?"}
    ],
    "tools": [{
      "type": "function",
      "function": {
        "name": "searxng-search",
        "description": "Search the web for current information"
      }
    }],
    "tool_choice": "auto"
  }'

Python Web Search Example

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

def search_and_answer(question: str) -> str:
    """Use AI with web search to answer questions."""
    response = client.chat.completions.create(
        model="qwen3-30b-a3b-q8-abl",  # Model with function calling support
        messages=[
            {"role": "system", "content": "You are a helpful assistant with web search capabilities. Use search when you need current information."},
            {"role": "user", "content": question}
        ],
        tools=[{
            "type": "function",
            "function": {
                "name": "searxng-search",
                "description": "Search the web for information",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {
                            "type": "string",
                            "description": "The search query"
                        }
                    },
                    "required": ["query"]
                }
            }
        }],
        tool_choice="auto"
    )

    return response.choices[0].message.content

# Usage
answer = search_and_answer("What are the latest news about OpenAI?")
print(answer)

answer = search_and_answer("What's the current price of Bitcoin?")
print(answer)

Research Assistant with Search

from openai import OpenAI
import json

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

def research_topic(topic: str, depth: str = "overview") -> dict:
    """Research a topic using AI and web search."""

    prompts = {
        "overview": f"Provide a brief overview of {topic} with current information.",
        "detailed": f"Provide a comprehensive analysis of {topic} including recent developments, key players, and future outlook.",
        "news": f"What are the latest news and developments about {topic}?"
    }

    response = client.chat.completions.create(
        model="qwen3-30b-a3b-q8-abl",
        messages=[
            {"role": "system", "content": "You are a research assistant. Search the web for current information and provide well-sourced responses."},
            {"role": "user", "content": prompts.get(depth, prompts["overview"])}
        ],
        tools=[{
            "type": "function",
            "function": {
                "name": "searxng-search",
                "description": "Search the web"
            }
        }],
        tool_choice="auto"
    )

    return {
        "topic": topic,
        "depth": depth,
        "response": response.choices[0].message.content,
        "model": response.model,
        "usage": {
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens
        }
    }

# Usage
research = research_topic("quantum computing", "detailed")
print(research["response"])

Pass-through Endpoints

Pass-through endpoints provide direct access to backend services, bypassing LiteLLM's routing logic. Useful for:
- Avoiding API key requirements for internal services
- Direct access when you know which backend to use
- Reduced latency (no routing overhead)

Available Pass-through Endpoints

Direct TTS Access

Endpoint	Backend	Purpose
`/tts/v1/audio/speech`	openedai-speech	Direct Piper/XTTS TTS
`/styletts2/v1/audio/speech`	styletts2-openai	Direct StyleTTS2 TTS
`/stt/v1/audio/transcriptions`	faster-whisper	Direct Whisper STT

# Direct Piper TTS (no API key needed on internal network)
curl https://llm.haiven.local/tts/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Direct access to Piper TTS.",
    "voice": "alloy"
  }' --output direct_piper.mp3

# Direct StyleTTS2 (no API key needed on internal network)
curl https://llm.haiven.local/styletts2/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Direct access to StyleTTS2.",
    "voice": "nova"
  }' --output direct_styletts2.wav

Direct STT Access

# Direct Whisper STT (no API key needed on internal network)
curl https://llm.haiven.local/stt/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=whisper-1"

When to Use Pass-through

API Key Management

Types of Keys

Creating a Key via API

Scenario	Use Pass-through	Use Standard API
Internal automation	Yes	-
Usage tracking needed	-	Yes
API key management	-	Yes
Lowest latency	Yes	-
Budget limits	-	Yes
Langfuse tracing	-	Yes

Type	Purpose	Who Creates It
Master Key	Full admin access	System admin
Virtual Key	Limited access	Created via API/UI

curl https://llm.haiven.local/key/generate \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["gpt-4", "gpt-3.5-turbo", "tts-1", "whisper-1"],
    "user_id": "my-project",
    "max_budget": 50.00,
    "duration": "30d",
    "metadata": {"project": "my-app", "team": "engineering"}
  }'

Creating Keys for Specific Use Cases

# TTS-only key
curl https://llm.haiven.local/key/generate \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["tts-1", "tts-1-hd", "styletts2"],
    "user_id": "tts-service",
    "max_budget": 10.00,
    "duration": "7d"
  }'

# STT-only key
curl https://llm.haiven.local/key/generate \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["whisper-1", "whisper-large-v3"],
    "user_id": "transcription-service",
    "max_budget": 20.00,
    "duration": "30d"
  }'

# Full access key for development
curl https://llm.haiven.local/key/generate \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": [],
    "user_id": "developer",
    "max_budget": 100.00,
    "duration": "90d",
    "metadata": {"environment": "development"}
  }'

Checking Key Info

curl https://llm.haiven.local/key/info \
  -H "Authorization: Bearer $YOUR_KEY"

Deleting a Key

curl -X POST https://llm.haiven.local/key/delete \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"keys": ["sk-..."]}'

Advanced Workflows

Voice-Enabled Chatbot

from openai import OpenAI
from pathlib import Path
import tempfile

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

class VoiceChatbot:
    def __init__(self, voice: str = "nova"):
        self.voice = voice
        self.messages = [
            {"role": "system", "content": "You are a friendly voice assistant. Keep responses concise and conversational."}
        ]

    def transcribe(self, audio_path: str) -> str:
        """Convert user speech to text."""
        with open(audio_path, "rb") as audio_file:
            transcript = client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file
            )
        return transcript.text

    def think(self, user_text: str) -> str:
        """Generate AI response."""
        self.messages.append({"role": "user", "content": user_text})

        response = client.chat.completions.create(
            model="gpt-4",
            messages=self.messages,
            max_tokens=150
        )

        assistant_response = response.choices[0].message.content
        self.messages.append({"role": "assistant", "content": assistant_response})

        return assistant_response

    def speak(self, text: str, output_path: str = None) -> str:
        """Convert AI response to speech."""
        response = client.audio.speech.create(
            model="tts-1",  # Fast for conversation
            voice=self.voice,
            input=text
        )

        if output_path is None:
            output_path = tempfile.mktemp(suffix=".mp3")

        response.stream_to_file(Path(output_path))
        return output_path

    def chat(self, audio_input_path: str) -> tuple[str, str, str]:
        """Full voice chat cycle: listen -> think -> speak."""
        # 1. Transcribe user speech
        user_text = self.transcribe(audio_input_path)

        # 2. Generate response
        response_text = self.think(user_text)

        # 3. Convert to speech
        audio_output_path = self.speak(response_text)

        return user_text, response_text, audio_output_path

# Usage
bot = VoiceChatbot(voice="nova")
user_said, bot_said, audio_file = bot.chat("user_recording.mp3")
print(f"User: {user_said}")
print(f"Bot: {bot_said}")
print(f"Audio: {audio_file}")

Podcast Transcription and Summarization

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

def process_podcast(audio_path: str) -> dict:
    """Transcribe podcast and generate summary with highlights."""

    # 1. Transcribe with timestamps
    with open(audio_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=audio_file,
            response_format="verbose_json",
            timestamp_granularities=["segment"]
        )

    # 2. Generate summary
    summary_response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an expert at summarizing podcast content. Create concise, informative summaries."},
            {"role": "user", "content": f"Summarize this podcast transcript in 3-5 bullet points:\n\n{transcript.text}"}
        ]
    )

    # 3. Extract key moments
    moments_response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Identify the 3 most interesting or important moments in this transcript."},
            {"role": "user", "content": transcript.text}
        ]
    )

    return {
        "duration": transcript.duration,
        "full_transcript": transcript.text,
        "segments": transcript.segments,
        "summary": summary_response.choices[0].message.content,
        "key_moments": moments_response.choices[0].message.content
    }

# Usage
result = process_podcast("episode_42.mp3")
print(f"Duration: {result['duration']:.1f} seconds")
print(f"\nSummary:\n{result['summary']}")
print(f"\nKey Moments:\n{result['key_moments']}")

Content Creation Pipeline

from openai import OpenAI
from pathlib import Path

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

def create_audio_content(topic: str, style: str = "educational") -> dict:
    """Generate script and audio content on any topic."""

    style_prompts = {
        "educational": "Create an informative, clear explanation suitable for learning.",
        "entertaining": "Create an engaging, fun narrative that entertains while informing.",
        "professional": "Create a polished, business-appropriate presentation.",
        "conversational": "Create a casual, friendly discussion as if talking to a friend."
    }

    voice_for_style = {
        "educational": "fable",
        "entertaining": "nova",
        "professional": "onyx",
        "conversational": "alloy"
    }

    # 1. Generate script
    script_response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": f"You are a content creator. {style_prompts.get(style, style_prompts['educational'])}"},
            {"role": "user", "content": f"Create a 1-minute script about: {topic}"}
        ],
        max_tokens=500
    )

    script = script_response.choices[0].message.content

    # 2. Generate audio
    voice = voice_for_style.get(style, "alloy")
    audio_response = client.audio.speech.create(
        model="styletts2",  # High quality for content
        voice=voice,
        input=script,
        response_format="wav"
    )

    output_path = Path(f"{topic.replace(' ', '_')}_{style}.wav")
    audio_response.stream_to_file(output_path)

    return {
        "topic": topic,
        "style": style,
        "script": script,
        "audio_file": str(output_path),
        "voice": voice
    }

# Usage
content = create_audio_content("quantum computing basics", "educational")
print(f"Script:\n{content['script']}")
print(f"\nAudio saved to: {content['audio_file']}")

Tips and Best Practices

1. Use System Prompts Effectively

messages = [
    {"role": "system", "content": """You are an expert Python programmer.
    - Write clean, readable code
    - Include docstrings and type hints
    - Handle errors gracefully"""},
    {"role": "user", "content": "Write a function to parse CSV files"}
]

2. Adjust Temperature for Your Use Case

3. Set Appropriate Max Tokens

Temperature	Use Case
0.0 - 0.3	Factual, deterministic (code, math)
0.3 - 0.7	Balanced (general chat)
0.7 - 1.0	Creative (stories, brainstorming)

# Short response expected
response = client.chat.completions.create(
    model="gpt-4",
    messages=[...],
    max_tokens=100  # Limit output length
)

4. Use Streaming for Long Responses

stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Write a long essay"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

5. Handle Rate Limits Gracefully

import time
from openai import RateLimitError

def chat_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4",
                messages=messages
            )
        except RateLimitError:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise

6. TTS Best Practices

7. STT Best Practices

Troubleshooting

Problem: "401 Unauthorized"

Solution:
1. Check your API key is correct
2. Ensure the Authorization header is set: Bearer YOUR_KEY
3. Verify the key hasn't expired

Problem: "Model not found"

Solution:
1. List available models: curl https://llm.haiven.local/v1/models
2. Check spelling of model name
3. Use a model alias like gpt-4

Problem: "Rate limit exceeded"

Solution:
1. Wait and retry (rate limit is 100 req/s average, 200 burst)
2. Implement exponential backoff
3. Request a higher rate limit from admin

Problem: "Connection refused"

Solution:
1. Check service health: curl https://llm.haiven.local/health
2. Verify you're on the correct network
3. Contact system admin if issues persist

Problem: Slow responses

Solution:
1. First request to a model is slower (model loading)
2. Use smaller models for faster responses
3. Set reasonable max_tokens limits

Problem: TTS audio quality is poor

Solution:
1. Use styletts2 for high-quality output
2. Try different voices for your content type
3. Add proper punctuation to input text

Problem: STT transcription is inaccurate

Solution:
1. Improve audio quality if possible
2. Specify the correct language in the request
3. Use whisper-large-v3 for better accuracy

FAQ

Q: Do I need an API key?

Yes, all requests require an API key. Contact your admin to get one, or create one yourself if you have master key access.

Q: Which model should I use?

For most tasks, gpt-4 (which maps to qwen3-30b-a3b) is a good default. For faster responses, try gpt-3.5-turbo.

Q: Is there a cost?

LiteLLM tracks usage but costs depend on your organization's policies. Check your usage in the Admin UI.

Q: Can I use this with ChatGPT clients?

Yes! Any OpenAI-compatible client works. Just change the base URL to https://llm.haiven.local/v1.

Q: How do I see my usage?

Visit https://litellm.haiven.local/ui and log in with your API key to see your usage dashboard.

Q: Are my conversations logged?

Yes, requests are logged to Langfuse for observability. Contact your admin about data retention policies.

Q: How do I use text-to-speech?

Use the /v1/audio/speech endpoint with model tts-1 (fast), tts-1-hd (better), or styletts2 (best). Choose a voice like alloy, nova, or echo.

Q: How do I transcribe audio?

Use the /v1/audio/transcriptions endpoint with model whisper-1 or whisper-large-v3. Upload your audio file as a multipart form.

Q: What audio formats are supported?

For TTS output: mp3, wav, opus, flac, aac, pcm. For STT input: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, flac.

Q: Can the AI search the web?

Yes! Models with function calling support can use the SearXNG search tool. Include tools in your request to enable web search.

Q: What's the difference between the TTS models?

Q: Can I use TTS/STT without going through LiteLLM?

Yes, use the pass-through endpoints (/tts/..., /styletts2/..., /stt/...) for direct access to the backends.

Parameter	Type	Description
`model`	string	Model to use
`messages`	array	Conversation history
`temperature`	float	Randomness (0-2)
`max_tokens`	int	Max response length
`stream`	bool	Enable streaming
`top_p`	float	Nucleus sampling
`frequency_penalty`	float	Reduce repetition
`presence_penalty`	float	Encourage new topics

Role	Purpose
`system`	Set AI behavior/personality
`user`	Your input
`assistant`	AI's previous responses

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat with AI
`/v1/completions`	POST	Text completion
`/v1/models`	GET	List models
`/v1/embeddings`	POST	Generate embeddings
`/v1/audio/speech`	POST	Text-to-speech
`/v1/audio/transcriptions`	POST	Speech-to-text
`/v1/audio/translations`	POST	Translate audio to English
`/tts/v1/audio/speech`	POST	Direct Piper TTS
`/styletts2/v1/audio/speech`	POST	Direct StyleTTS2
`/stt/v1/audio/transcriptions`	POST	Direct Whisper
`/health`	GET	Health check

LiteLLM Proxy User Guide

Table of Contents

Getting Started