LiteLLM Proxy User Guide

Your gateway to AI-powered conversations, speech synthesis, transcription, and web search on Haiven


Table of Contents

  1. Getting Started
  2. Accessing the API
  3. Using the Admin UI
  4. Common Use Cases
  5. Working with Models
  6. Text-to-Speech (TTS) Deep Dive
  7. Speech-to-Text (STT) Deep Dive
  8. Web Search Integration
  9. Pass-through Endpoints
  10. API Key Management
  11. Advanced Workflows
  12. Tips and Best Practices
  13. Troubleshooting
  14. FAQ
  15. Quick Reference

Getting Started

LiteLLM is your central hub for accessing all AI capabilities on Haiven. It provides a standard OpenAI-compatible API, so any tool or library that works with OpenAI will work with LiteLLM.

What You Can Do

Access URLs

What Where
API Endpoint https://llm.haiven.local/v1
Admin Dashboard https://litellm.haiven.local/ui
Health Status https://llm.haiven.local/health
TTS Pass-through (Piper) https://llm.haiven.local/tts/v1/audio/speech
TTS Pass-through (StyleTTS2) https://llm.haiven.local/styletts2/v1/audio/speech
STT Pass-through https://llm.haiven.local/stt/v1/audio/transcriptions

Accessing the API

From Command Line (curl)

# Simple chat request
curl https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

From Python

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

response = client.chat.completions.create(
    model="qwen3-30b-a3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's the weather like?"}
    ]
)

print(response.choices[0].message.content)

From JavaScript/Node.js

import OpenAI from 'openai';

const openai = new OpenAI({
  baseURL: 'https://llm.haiven.local/v1',
  apiKey: 'YOUR_API_KEY',
});

const response = await openai.chat.completions.create({
  model: 'qwen3-30b-a3b',
  messages: [{ role: 'user', content: 'Hello!' }],
});

console.log(response.choices[0].message.content);

Internal Docker Access

If you're running code inside the Haiven Docker network:

client = OpenAI(
    base_url="http://litellm:4000/v1",
    api_key="YOUR_API_KEY"
)

Using the Admin UI

Accessing the Dashboard

  1. Open your browser to https://litellm.haiven.local/ui
  2. Log in with your API key (or master key for admin access)
  3. You'll see the main dashboard with usage statistics

Dashboard Features

Tab What It Shows
Dashboard Usage statistics, request counts, token usage
Keys Create and manage API keys
Models Available models and their configurations
Usage Detailed spend and usage logs
Settings Configuration options

Creating a New API Key

  1. Click on Keys in the sidebar
  2. Click Create New Key
  3. Fill in the details:
    - Key Name: A descriptive name (e.g., "My Project")
    - Models: Which models this key can access
    - Budget: Optional spending limit
    - Duration: How long the key is valid
  4. Click Create
  5. Copy the generated key (you won't see it again!)

Common Use Cases

1. Simple Chat Conversation

curl https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "gpt-4",
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant."},
      {"role": "user", "content": "How do I read a file in Python?"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

2. Streaming Response

Get responses as they're generated:

curl https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [{"role": "user", "content": "Write a short story"}],
    "stream": true
  }'

3. Code Generation

curl https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [
      {"role": "system", "content": "You are an expert Python programmer. Write clean, efficient code."},
      {"role": "user", "content": "Write a function to calculate Fibonacci numbers with memoization"}
    ],
    "temperature": 0.2
  }'

4. Multi-turn Conversation

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

messages = [
    {"role": "system", "content": "You are a helpful tutor."}
]

# First turn
messages.append({"role": "user", "content": "Explain what a variable is in programming"})
response = client.chat.completions.create(model="gpt-4", messages=messages)
messages.append({"role": "assistant", "content": response.choices[0].message.content})

# Second turn
messages.append({"role": "user", "content": "Can you give me an example?"})
response = client.chat.completions.create(model="gpt-4", messages=messages)
print(response.choices[0].message.content)

5. JSON Mode Response

Get structured JSON output:

curl https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "qwen3-30b-a3b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant that responds in JSON format."},
      {"role": "user", "content": "List 3 programming languages with their key features"}
    ],
    "response_format": {"type": "json_object"}
  }'

Working with Models

Available LLM Models

Model Name Best For Size
qwen3-30b-a3b General purpose, coding 30B params
qwen2.5-14b-instruct Fast responses 14B params
gemma3-27b General purpose 27B params
gpt-oss-120b Complex reasoning 120B params

Model Aliases

For compatibility with OpenAI tools, you can use these aliases:

Alias Maps To
gpt-4 qwen3-30b-a3b
gpt-4-turbo qwen3-30b-a3b
gpt-3.5-turbo qwen2.5-14b-instruct

List All Models

curl https://llm.haiven.local/v1/models \
  -H "Authorization: Bearer $API_KEY" | jq '.data[].id'

Choosing the Right Model


Text-to-Speech (TTS) Deep Dive

LiteLLM provides access to three TTS engines, each with different characteristics.

TTS Engines Comparison

Model Engine Speed Quality GPU Required Best For
tts-1 Piper (ONNX) Very Fast Good No (CPU) Quick responses, notifications
tts-1-hd XTTS Medium High No (CPU) Voice cloning, professional audio
styletts2 StyleTTS2 Slow Highest Yes (RTX 4090) Maximum quality, style transfer

Available Voices

All three engines support these OpenAI-compatible voice names:

Voice Description Character
alloy Neutral, balanced Professional, clear
echo Male, warm Approachable, friendly
fable British accent Storyteller, narrator
onyx Male, deep Authoritative, commanding
nova Female, friendly Conversational, warm
shimmer Female, expressive Energetic, enthusiastic

Basic TTS Usage

# Using Piper TTS (fastest)
curl https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "tts-1",
    "input": "Hello! Welcome to Haiven.",
    "voice": "alloy"
  }' --output hello.mp3

High-Quality TTS with StyleTTS2

# Using StyleTTS2 (highest quality)
curl https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "styletts2",
    "input": "This is professional-quality neural speech synthesis.",
    "voice": "nova"
  }' --output professional.wav

Python TTS Examples

from openai import OpenAI
from pathlib import Path

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

# Quick TTS with Piper
def quick_speak(text: str, output_file: str = "output.mp3"):
    """Fast TTS for notifications and quick responses."""
    response = client.audio.speech.create(
        model="tts-1",
        voice="alloy",
        input=text
    )
    response.stream_to_file(Path(output_file))
    return output_file

# High-quality TTS with StyleTTS2
def professional_speak(text: str, voice: str = "nova", output_file: str = "professional.wav"):
    """High-quality TTS for professional content."""
    response = client.audio.speech.create(
        model="styletts2",
        voice=voice,
        input=text
    )
    response.stream_to_file(Path(output_file))
    return output_file

# Generate all voices for comparison
def generate_voice_samples(text: str):
    """Generate samples of all available voices."""
    voices = ["alloy", "echo", "fable", "onyx", "nova", "shimmer"]

    for voice in voices:
        response = client.audio.speech.create(
            model="tts-1",
            voice=voice,
            input=text
        )
        response.stream_to_file(Path(f"sample_{voice}.mp3"))
        print(f"Generated sample_{voice}.mp3")

# Usage
quick_speak("You have a new message.")
professional_speak("Welcome to our quarterly earnings call.")
generate_voice_samples("The quick brown fox jumps over the lazy dog.")

Audio Output Formats

Format Extension Description File Size
mp3 .mp3 Most compatible, lossy Medium
opus .opus Best compression, lossy Small
aac .aac Apple devices, lossy Medium
flac .flac Lossless compression Large
wav .wav Uncompressed, lossless Very Large
pcm .pcm Raw audio, lossless Very Large
# Generate in different formats
curl https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "tts-1",
    "input": "Testing different audio formats.",
    "voice": "alloy",
    "response_format": "opus"
  }' --output speech.opus

# High-quality WAV for editing
curl https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "styletts2",
    "input": "Uncompressed audio for post-processing.",
    "voice": "nova",
    "response_format": "wav"
  }' --output speech.wav

Speed Control

Adjust speech speed with the speed parameter (0.25 to 4.0):

# Slow narration (0.75x speed)
curl https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "tts-1",
    "input": "This is spoken slowly for clarity and emphasis.",
    "voice": "fable",
    "speed": 0.75
  }' --output slow.mp3

# Fast reading (1.5x speed)
curl https://llm.haiven.local/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "tts-1",
    "input": "This is spoken quickly for time-sensitive information.",
    "voice": "alloy",
    "speed": 1.5
  }' --output fast.mp3

TTS Use Case Examples

from openai import OpenAI
from pathlib import Path

client = OpenAI(base_url="https://llm.haiven.local/v1", api_key="YOUR_API_KEY")

# 1. Notification Sound
def create_notification(message: str):
    response = client.audio.speech.create(
        model="tts-1",  # Fast
        voice="alloy",
        input=message,
        speed=1.2  # Slightly faster
    )
    response.stream_to_file(Path("notification.mp3"))

# 2. Podcast Intro
def create_podcast_intro(show_name: str, episode_title: str):
    script = f"Welcome to {show_name}. Today's episode: {episode_title}."
    response = client.audio.speech.create(
        model="styletts2",  # High quality
        voice="onyx",  # Authoritative
        input=script
    )
    response.stream_to_file(Path("podcast_intro.wav"))

# 3. Audiobook Chapter
def narrate_chapter(text: str, chapter_num: int):
    response = client.audio.speech.create(
        model="styletts2",
        voice="fable",  # British narrator
        input=text,
        response_format="flac",  # Lossless for editing
        speed=0.9  # Slightly slower for clarity
    )
    response.stream_to_file(Path(f"chapter_{chapter_num}.flac"))

# 4. Voice Assistant Response
def assistant_response(text: str):
    response = client.audio.speech.create(
        model="tts-1",  # Fast response
        voice="nova",  # Friendly female
        input=text
    )
    return response.content  # Return bytes for immediate playback

# 5. Multi-language Announcement (using different voices)
def announcement(en_text: str, output_prefix: str):
    voices_per_region = {
        "us": "alloy",
        "uk": "fable",
        "casual": "nova"
    }

    for region, voice in voices_per_region.items():
        response = client.audio.speech.create(
            model="tts-1",
            voice=voice,
            input=en_text
        )
        response.stream_to_file(Path(f"{output_prefix}_{region}.mp3"))

Batch TTS Generation

import asyncio
from openai import AsyncOpenAI
from pathlib import Path

async_client = AsyncOpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

async def batch_tts(texts: list[str], voice: str = "alloy"):
    """Generate TTS for multiple texts concurrently."""
    async def generate_one(idx: int, text: str):
        response = await async_client.audio.speech.create(
            model="tts-1",
            voice=voice,
            input=text
        )
        output_path = Path(f"output_{idx}.mp3")
        response.stream_to_file(output_path)
        return output_path

    tasks = [generate_one(i, text) for i, text in enumerate(texts)]
    results = await asyncio.gather(*tasks)
    return results

# Usage
texts = [
    "First notification message.",
    "Second notification message.",
    "Third notification message."
]
asyncio.run(batch_tts(texts))

Speech-to-Text (STT) Deep Dive

LiteLLM provides GPU-accelerated speech recognition using Faster-Whisper.

STT Models

Model Speed Accuracy Languages
whisper-1 Fast High 99+ languages
whisper-large-v3 Medium Highest 99+ languages

Both models run on GPU for fast transcription and support automatic language detection.

Basic Transcription

# Simple transcription
curl https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@recording.mp3" \
  -F "model=whisper-1"

Response:

{
  "text": "Hello, this is a test recording of the speech recognition system."
}

Transcription with Timestamps

# Get word-level timestamps
curl https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@interview.wav" \
  -F "model=whisper-large-v3" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=word"

Response:

{
  "task": "transcribe",
  "language": "english",
  "duration": 15.5,
  "text": "Welcome to the interview.",
  "words": [
    {"word": "Welcome", "start": 0.0, "end": 0.5},
    {"word": "to", "start": 0.5, "end": 0.7},
    {"word": "the", "start": 0.7, "end": 0.9},
    {"word": "interview", "start": 0.9, "end": 1.5}
  ]
}

Segment-Level Timestamps

# Get segment-level timestamps (sentences/phrases)
curl https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@podcast.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=segment"

Response:

{
  "task": "transcribe",
  "language": "english",
  "duration": 120.5,
  "text": "Welcome to the show. Today we discuss AI.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Welcome to the show.",
      "tokens": [50364, 5765, 281, 264, 1656, 13],
      "temperature": 0.0,
      "avg_logprob": -0.25,
      "compression_ratio": 1.2,
      "no_speech_prob": 0.01
    },
    {
      "id": 1,
      "seek": 250,
      "start": 2.5,
      "end": 5.0,
      "text": "Today we discuss AI.",
      "tokens": [50364, 2692, 321, 2248, 7318, 13],
      "temperature": 0.0,
      "avg_logprob": -0.18,
      "compression_ratio": 1.1,
      "no_speech_prob": 0.02
    }
  ]
}

Language-Specific Transcription

# Transcribe Spanish audio
curl https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@spanish_audio.mp3" \
  -F "model=whisper-large-v3" \
  -F "language=es"

# Transcribe Japanese audio
curl https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@japanese_audio.wav" \
  -F "model=whisper-large-v3" \
  -F "language=ja"

Audio Translation (to English)

# Translate any language audio to English text
curl https://llm.haiven.local/v1/audio/translations \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@french_speech.mp3" \
  -F "model=whisper-large-v3"

Response Formats

Format Description Use Case
json Simple JSON with text Default, simple integration
text Plain text only Minimal processing
srt SubRip subtitle format Video subtitles
verbose_json Full details with timestamps Analytics, editing
vtt WebVTT subtitle format Web video players
# Generate SRT subtitles
curl https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@video_audio.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=srt" \
  --output subtitles.srt

# Generate VTT subtitles for web
curl https://llm.haiven.local/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@webinar.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=vtt" \
  --output captions.vtt

Python STT Examples

from openai import OpenAI
from pathlib import Path

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

# 1. Simple Transcription
def transcribe_audio(file_path: str) -> str:
    """Transcribe audio file to text."""
    with open(file_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file
        )
    return transcript.text

# 2. Transcription with Timestamps
def transcribe_with_timestamps(file_path: str) -> dict:
    """Transcribe with word-level timestamps."""
    with open(file_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=audio_file,
            response_format="verbose_json",
            timestamp_granularities=["word", "segment"]
        )
    return transcript

# 3. Generate Subtitles
def generate_subtitles(file_path: str, format: str = "srt") -> str:
    """Generate subtitle file from audio."""
    with open(file_path, "rb") as audio_file:
        result = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=audio_file,
            response_format=format
        )

    output_path = Path(file_path).stem + f".{format}"
    with open(output_path, "w") as f:
        f.write(result)
    return output_path

# 4. Translate Foreign Audio to English
def translate_to_english(file_path: str) -> str:
    """Translate foreign language audio to English text."""
    with open(file_path, "rb") as audio_file:
        translation = client.audio.translations.create(
            model="whisper-large-v3",
            file=audio_file
        )
    return translation.text

# 5. Transcribe with Specific Language
def transcribe_language(file_path: str, language: str) -> str:
    """Transcribe audio in a specific language."""
    with open(file_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=audio_file,
            language=language  # e.g., "es", "fr", "de", "ja", "zh"
        )
    return transcript.text

# 6. Meeting Transcription with Speaker Diarization Prep
def meeting_transcription(file_path: str) -> dict:
    """Transcribe meeting with segments for speaker identification."""
    with open(file_path, "rb") as audio_file:
        result = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=audio_file,
            response_format="verbose_json",
            timestamp_granularities=["segment"]
        )

    # Process segments for meeting minutes
    segments = []
    for seg in result.segments:
        segments.append({
            "start": seg["start"],
            "end": seg["end"],
            "text": seg["text"],
            "duration": seg["end"] - seg["start"]
        })

    return {
        "full_text": result.text,
        "duration": result.duration,
        "segments": segments
    }

# Usage examples
text = transcribe_audio("recording.mp3")
print(f"Transcription: {text}")

detailed = transcribe_with_timestamps("interview.wav")
print(f"Duration: {detailed.duration}s")
for word in detailed.words[:5]:
    print(f"  {word['word']} ({word['start']:.2f}s - {word['end']:.2f}s)")

subtitles = generate_subtitles("video.mp3", "srt")
print(f"Subtitles saved to: {subtitles}")

english = translate_to_english("spanish_podcast.mp3")
print(f"English translation: {english}")

Supported Audio Formats

Format Extension Max Size
MP3 .mp3 25MB
MP4 Audio .mp4, .m4a 25MB
WAV .wav 25MB
WebM .webm 25MB
MPEG .mpeg, .mpga 25MB
OGG .ogg 25MB
FLAC .flac 25MB

Web Search Integration

LiteLLM integrates with SearXNG to enable AI models to search the web and provide up-to-date information.

curl https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "qwen3-30b-a3b-q8-abl",
    "messages": [
      {"role": "user", "content": "What are the latest developments in AI this week?"}
    ],
    "tools": [{
      "type": "function",
      "function": {
        "name": "searxng-search",
        "description": "Search the web for current information"
      }
    }],
    "tool_choice": "auto"
  }'

Python Web Search Example

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

def search_and_answer(question: str) -> str:
    """Use AI with web search to answer questions."""
    response = client.chat.completions.create(
        model="qwen3-30b-a3b-q8-abl",  # Model with function calling support
        messages=[
            {"role": "system", "content": "You are a helpful assistant with web search capabilities. Use search when you need current information."},
            {"role": "user", "content": question}
        ],
        tools=[{
            "type": "function",
            "function": {
                "name": "searxng-search",
                "description": "Search the web for information",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {
                            "type": "string",
                            "description": "The search query"
                        }
                    },
                    "required": ["query"]
                }
            }
        }],
        tool_choice="auto"
    )

    return response.choices[0].message.content

# Usage
answer = search_and_answer("What are the latest news about OpenAI?")
print(answer)

answer = search_and_answer("What's the current price of Bitcoin?")
print(answer)
from openai import OpenAI
import json

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

def research_topic(topic: str, depth: str = "overview") -> dict:
    """Research a topic using AI and web search."""

    prompts = {
        "overview": f"Provide a brief overview of {topic} with current information.",
        "detailed": f"Provide a comprehensive analysis of {topic} including recent developments, key players, and future outlook.",
        "news": f"What are the latest news and developments about {topic}?"
    }

    response = client.chat.completions.create(
        model="qwen3-30b-a3b-q8-abl",
        messages=[
            {"role": "system", "content": "You are a research assistant. Search the web for current information and provide well-sourced responses."},
            {"role": "user", "content": prompts.get(depth, prompts["overview"])}
        ],
        tools=[{
            "type": "function",
            "function": {
                "name": "searxng-search",
                "description": "Search the web"
            }
        }],
        tool_choice="auto"
    )

    return {
        "topic": topic,
        "depth": depth,
        "response": response.choices[0].message.content,
        "model": response.model,
        "usage": {
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens
        }
    }

# Usage
research = research_topic("quantum computing", "detailed")
print(research["response"])

Pass-through Endpoints

Pass-through endpoints provide direct access to backend services, bypassing LiteLLM's routing logic. Useful for:
- Avoiding API key requirements for internal services
- Direct access when you know which backend to use
- Reduced latency (no routing overhead)

Available Pass-through Endpoints

Endpoint Backend Purpose
/tts/v1/audio/speech openedai-speech Direct Piper/XTTS TTS
/styletts2/v1/audio/speech styletts2-openai Direct StyleTTS2 TTS
/stt/v1/audio/transcriptions faster-whisper Direct Whisper STT

Direct TTS Access

# Direct Piper TTS (no API key needed on internal network)
curl https://llm.haiven.local/tts/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Direct access to Piper TTS.",
    "voice": "alloy"
  }' --output direct_piper.mp3

# Direct StyleTTS2 (no API key needed on internal network)
curl https://llm.haiven.local/styletts2/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Direct access to StyleTTS2.",
    "voice": "nova"
  }' --output direct_styletts2.wav

Direct STT Access

# Direct Whisper STT (no API key needed on internal network)
curl https://llm.haiven.local/stt/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=whisper-1"

When to Use Pass-through

Scenario Use Pass-through Use Standard API
Internal automation Yes -
Usage tracking needed - Yes
API key management - Yes
Lowest latency Yes -
Budget limits - Yes
Langfuse tracing - Yes

API Key Management

Types of Keys

Type Purpose Who Creates It
Master Key Full admin access System admin
Virtual Key Limited access Created via API/UI

Creating a Key via API

curl https://llm.haiven.local/key/generate \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["gpt-4", "gpt-3.5-turbo", "tts-1", "whisper-1"],
    "user_id": "my-project",
    "max_budget": 50.00,
    "duration": "30d",
    "metadata": {"project": "my-app", "team": "engineering"}
  }'

Creating Keys for Specific Use Cases

# TTS-only key
curl https://llm.haiven.local/key/generate \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["tts-1", "tts-1-hd", "styletts2"],
    "user_id": "tts-service",
    "max_budget": 10.00,
    "duration": "7d"
  }'

# STT-only key
curl https://llm.haiven.local/key/generate \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["whisper-1", "whisper-large-v3"],
    "user_id": "transcription-service",
    "max_budget": 20.00,
    "duration": "30d"
  }'

# Full access key for development
curl https://llm.haiven.local/key/generate \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": [],
    "user_id": "developer",
    "max_budget": 100.00,
    "duration": "90d",
    "metadata": {"environment": "development"}
  }'

Checking Key Info

curl https://llm.haiven.local/key/info \
  -H "Authorization: Bearer $YOUR_KEY"

Deleting a Key

curl -X POST https://llm.haiven.local/key/delete \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"keys": ["sk-..."]}'

Advanced Workflows

Voice-Enabled Chatbot

from openai import OpenAI
from pathlib import Path
import tempfile

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

class VoiceChatbot:
    def __init__(self, voice: str = "nova"):
        self.voice = voice
        self.messages = [
            {"role": "system", "content": "You are a friendly voice assistant. Keep responses concise and conversational."}
        ]

    def transcribe(self, audio_path: str) -> str:
        """Convert user speech to text."""
        with open(audio_path, "rb") as audio_file:
            transcript = client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file
            )
        return transcript.text

    def think(self, user_text: str) -> str:
        """Generate AI response."""
        self.messages.append({"role": "user", "content": user_text})

        response = client.chat.completions.create(
            model="gpt-4",
            messages=self.messages,
            max_tokens=150
        )

        assistant_response = response.choices[0].message.content
        self.messages.append({"role": "assistant", "content": assistant_response})

        return assistant_response

    def speak(self, text: str, output_path: str = None) -> str:
        """Convert AI response to speech."""
        response = client.audio.speech.create(
            model="tts-1",  # Fast for conversation
            voice=self.voice,
            input=text
        )

        if output_path is None:
            output_path = tempfile.mktemp(suffix=".mp3")

        response.stream_to_file(Path(output_path))
        return output_path

    def chat(self, audio_input_path: str) -> tuple[str, str, str]:
        """Full voice chat cycle: listen -> think -> speak."""
        # 1. Transcribe user speech
        user_text = self.transcribe(audio_input_path)

        # 2. Generate response
        response_text = self.think(user_text)

        # 3. Convert to speech
        audio_output_path = self.speak(response_text)

        return user_text, response_text, audio_output_path

# Usage
bot = VoiceChatbot(voice="nova")
user_said, bot_said, audio_file = bot.chat("user_recording.mp3")
print(f"User: {user_said}")
print(f"Bot: {bot_said}")
print(f"Audio: {audio_file}")

Podcast Transcription and Summarization

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

def process_podcast(audio_path: str) -> dict:
    """Transcribe podcast and generate summary with highlights."""

    # 1. Transcribe with timestamps
    with open(audio_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=audio_file,
            response_format="verbose_json",
            timestamp_granularities=["segment"]
        )

    # 2. Generate summary
    summary_response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an expert at summarizing podcast content. Create concise, informative summaries."},
            {"role": "user", "content": f"Summarize this podcast transcript in 3-5 bullet points:\n\n{transcript.text}"}
        ]
    )

    # 3. Extract key moments
    moments_response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Identify the 3 most interesting or important moments in this transcript."},
            {"role": "user", "content": transcript.text}
        ]
    )

    return {
        "duration": transcript.duration,
        "full_transcript": transcript.text,
        "segments": transcript.segments,
        "summary": summary_response.choices[0].message.content,
        "key_moments": moments_response.choices[0].message.content
    }

# Usage
result = process_podcast("episode_42.mp3")
print(f"Duration: {result['duration']:.1f} seconds")
print(f"\nSummary:\n{result['summary']}")
print(f"\nKey Moments:\n{result['key_moments']}")

Content Creation Pipeline

from openai import OpenAI
from pathlib import Path

client = OpenAI(
    base_url="https://llm.haiven.local/v1",
    api_key="YOUR_API_KEY"
)

def create_audio_content(topic: str, style: str = "educational") -> dict:
    """Generate script and audio content on any topic."""

    style_prompts = {
        "educational": "Create an informative, clear explanation suitable for learning.",
        "entertaining": "Create an engaging, fun narrative that entertains while informing.",
        "professional": "Create a polished, business-appropriate presentation.",
        "conversational": "Create a casual, friendly discussion as if talking to a friend."
    }

    voice_for_style = {
        "educational": "fable",
        "entertaining": "nova",
        "professional": "onyx",
        "conversational": "alloy"
    }

    # 1. Generate script
    script_response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": f"You are a content creator. {style_prompts.get(style, style_prompts['educational'])}"},
            {"role": "user", "content": f"Create a 1-minute script about: {topic}"}
        ],
        max_tokens=500
    )

    script = script_response.choices[0].message.content

    # 2. Generate audio
    voice = voice_for_style.get(style, "alloy")
    audio_response = client.audio.speech.create(
        model="styletts2",  # High quality for content
        voice=voice,
        input=script,
        response_format="wav"
    )

    output_path = Path(f"{topic.replace(' ', '_')}_{style}.wav")
    audio_response.stream_to_file(output_path)

    return {
        "topic": topic,
        "style": style,
        "script": script,
        "audio_file": str(output_path),
        "voice": voice
    }

# Usage
content = create_audio_content("quantum computing basics", "educational")
print(f"Script:\n{content['script']}")
print(f"\nAudio saved to: {content['audio_file']}")

Tips and Best Practices

1. Use System Prompts Effectively

messages = [
    {"role": "system", "content": """You are an expert Python programmer.
    - Write clean, readable code
    - Include docstrings and type hints
    - Handle errors gracefully"""},
    {"role": "user", "content": "Write a function to parse CSV files"}
]

2. Adjust Temperature for Your Use Case

Temperature Use Case
0.0 - 0.3 Factual, deterministic (code, math)
0.3 - 0.7 Balanced (general chat)
0.7 - 1.0 Creative (stories, brainstorming)

3. Set Appropriate Max Tokens

# Short response expected
response = client.chat.completions.create(
    model="gpt-4",
    messages=[...],
    max_tokens=100  # Limit output length
)

4. Use Streaming for Long Responses

Streaming shows output as it's generated, improving perceived speed:

stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Write a long essay"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

5. Handle Rate Limits Gracefully

import time
from openai import RateLimitError

def chat_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4",
                messages=messages
            )
        except RateLimitError:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise

6. TTS Best Practices

7. STT Best Practices


Troubleshooting

Problem: "401 Unauthorized"

Cause: Invalid or missing API key

Solution:
1. Check your API key is correct
2. Ensure the Authorization header is set: Bearer YOUR_KEY
3. Verify the key hasn't expired

Problem: "Model not found"

Cause: Requested model doesn't exist or isn't available

Solution:
1. List available models: curl https://llm.haiven.local/v1/models
2. Check spelling of model name
3. Use a model alias like gpt-4

Problem: "Rate limit exceeded"

Cause: Too many requests too quickly

Solution:
1. Wait and retry (rate limit is 100 req/s average, 200 burst)
2. Implement exponential backoff
3. Request a higher rate limit from admin

Problem: "Connection refused"

Cause: Service might be down or unreachable

Solution:
1. Check service health: curl https://llm.haiven.local/health
2. Verify you're on the correct network
3. Contact system admin if issues persist

Problem: Slow responses

Cause: Model loading or high server load

Solution:
1. First request to a model is slower (model loading)
2. Use smaller models for faster responses
3. Set reasonable max_tokens limits

Problem: TTS audio quality is poor

Cause: Using wrong model or voice for content type

Solution:
1. Use styletts2 for high-quality output
2. Try different voices for your content type
3. Add proper punctuation to input text

Problem: STT transcription is inaccurate

Cause: Poor audio quality or wrong language setting

Solution:
1. Improve audio quality if possible
2. Specify the correct language in the request
3. Use whisper-large-v3 for better accuracy


FAQ

Q: Do I need an API key?

Yes, all requests require an API key. Contact your admin to get one, or create one yourself if you have master key access.

Q: Which model should I use?

For most tasks, gpt-4 (which maps to qwen3-30b-a3b) is a good default. For faster responses, try gpt-3.5-turbo.

Q: Is there a cost?

LiteLLM tracks usage but costs depend on your organization's policies. Check your usage in the Admin UI.

Q: Can I use this with ChatGPT clients?

Yes! Any OpenAI-compatible client works. Just change the base URL to https://llm.haiven.local/v1.

Q: How do I see my usage?

Visit https://litellm.haiven.local/ui and log in with your API key to see your usage dashboard.

Q: Are my conversations logged?

Yes, requests are logged to Langfuse for observability. Contact your admin about data retention policies.

Q: How do I use text-to-speech?

Use the /v1/audio/speech endpoint with model tts-1 (fast), tts-1-hd (better), or styletts2 (best). Choose a voice like alloy, nova, or echo.

Q: How do I transcribe audio?

Use the /v1/audio/transcriptions endpoint with model whisper-1 or whisper-large-v3. Upload your audio file as a multipart form.

Q: What audio formats are supported?

For TTS output: mp3, wav, opus, flac, aac, pcm. For STT input: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, flac.

Q: Can the AI search the web?

Yes! Models with function calling support can use the SearXNG search tool. Include tools in your request to enable web search.

Q: What's the difference between the TTS models?

Q: Can I use TTS/STT without going through LiteLLM?

Yes, use the pass-through endpoints (/tts/..., /styletts2/..., /stt/...) for direct access to the backends.


Quick Reference

Endpoints

Endpoint Method Description
/v1/chat/completions POST Chat with AI
/v1/completions POST Text completion
/v1/models GET List models
/v1/embeddings POST Generate embeddings
/v1/audio/speech POST Text-to-speech
/v1/audio/transcriptions POST Speech-to-text
/v1/audio/translations POST Translate audio to English
/tts/v1/audio/speech POST Direct Piper TTS
/styletts2/v1/audio/speech POST Direct StyleTTS2
/stt/v1/audio/transcriptions POST Direct Whisper
/health GET Health check

TTS Models

Model Speed Quality Backend
tts-1 Fast Good Piper (CPU)
tts-1-hd Medium High XTTS (CPU)
styletts2 Slow Highest StyleTTS2 (GPU)

TTS Voices

Voice Description
alloy Neutral, balanced
echo Male, warm
fable British accent
onyx Male, deep
nova Female, friendly
shimmer Female, expressive

STT Models

Model Speed Accuracy
whisper-1 Fast High
whisper-large-v3 Medium Highest

Common Parameters

Parameter Type Description
model string Model to use
messages array Conversation history
temperature float Randomness (0-2)
max_tokens int Max response length
stream bool Enable streaming
top_p float Nucleus sampling
frequency_penalty float Reduce repetition
presence_penalty float Encourage new topics

Message Roles

Role Purpose
system Set AI behavior/personality
user Your input
assistant AI's previous responses

Example Request

curl https://llm.haiven.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Need Help?