{"openapi":"3.1.0","info":{"title":"StyleTTS2 API","description":"High-quality text-to-speech synthesis with voice cloning capabilities.\n\nStyleTTS2 uses a diffusion-based approach to generate human-level speech\nwith fine-grained control over prosody, timbre, and emotional expression.\n\n## Features\n- Human-level speech synthesis quality\n- Voice cloning from short audio samples (3-10 seconds)\n- Fine-grained control via alpha (timbre) and beta (prosody) parameters\n- Adjustable synthesis quality via diffusion steps\n- GPU-accelerated inference on NVIDIA RTX PRO 6000 Blackwell\n\n## Access\n- Direct: `https://tts.haiven.site`\n- Unified Gateway: `https://ai.haiven.site/v1/audio/speech?engine=styletts2`\n","version":"1.0.0","contact":{"name":"Haiven Infrastructure"},"license":{"name":"MIT","url":"https://github.com/yl4579/StyleTTS2/blob/main/LICENSE"}},"servers":[{"url":"https://tts.haiven.site","description":"Primary StyleTTS2 endpoint"},{"url":"https://ai.haiven.site","description":"Unified AI gateway (use ?engine=styletts2)"},{"url":"http://styletts2:5001","description":"Internal Docker network access"}],"tags":[{"name":"Speech","description":"Text-to-speech generation endpoints"},{"name":"Voices","description":"Reference voice management"},{"name":"Health","description":"Service health and monitoring"}],"paths":{"/":{"get":{"tags":["Health"],"summary":"Web UI","description":"Interactive web interface for testing StyleTTS2 synthesis","operationId":"getWebUI","responses":{"200":{"description":"HTML web interface","content":{"text/html":{"schema":{"type":"string"}}}}}}},"/api":{"get":{"tags":["Health"],"summary":"API Information","description":"Returns information about the API and available endpoints","operationId":"getApiInfo","responses":{"200":{"description":"API information","content":{"application/json":{"schema":{"$ref":"#/components/schemas/ApiInfo"},"example":{"service":"StyleTTS2 API","version":"1.0.0","endpoints":{"/health":"Health check","/v1/voices":"List available reference voices","/v1/voices/upload":"Upload new reference voice (POST)","/v1/voices/<filename>":"Delete reference voice (DELETE)","/v1/audio/speech":"Generate speech (POST)"}}}}}}}},"/health":{"get":{"tags":["Health"],"summary":"Health Check","description":"Returns service health status including GPU and model information","operationId":"getHealth","responses":{"200":{"description":"Health status","content":{"application/json":{"schema":{"$ref":"#/components/schemas/HealthResponse"},"example":{"status":"healthy","model_loaded":true,"device":"cuda","gpu_available":true,"gpu_name":"NVIDIA RTX PRO 6000 Blackwell","gpu_memory":{"allocated":"0.72 GB","reserved":"0.74 GB"}}}}}}}},"/metrics":{"get":{"tags":["Health"],"summary":"Prometheus Metrics","description":"Returns Prometheus-format metrics for monitoring","operationId":"getMetrics","responses":{"200":{"description":"Prometheus metrics","content":{"text/plain":{"schema":{"type":"string"},"example":"# HELP tts_requests_total Total TTS requests\n# TYPE tts_requests_total counter\ntts_requests_total{voice=\"default\",status=\"success\"} 42\n"}}}}}},"/v1/audio/speech":{"post":{"tags":["Speech"],"summary":"Generate Speech","description":"Synthesizes speech from text using StyleTTS2 with optional voice cloning.\n\n## Parameters\n- **alpha**: Controls timbre (voice characteristics). Higher values make output sound more like reference.\n- **beta**: Controls prosody (rhythm, intonation). Higher values add more expression.\n- **diffusion_steps**: Quality vs speed tradeoff. More steps = higher quality but slower.\n- **embedding_scale**: Emotional intensity multiplier.\n- **speed**: Speech rate adjustment (post-processing).\n\n## Voice Cloning\nTo clone a voice, first upload a reference audio file via `/v1/voices/upload`,\nthen specify the filename in the `reference_audio` parameter.\n","operationId":"generateSpeech","requestBody":{"required":true,"content":{"application/json":{"schema":{"$ref":"#/components/schemas/SpeechRequest"},"examples":{"basic":{"summary":"Basic speech generation","value":{"input":"Hello, this is a test of the StyleTTS2 system."}},"voiceClone":{"summary":"Voice cloning with custom parameters","value":{"input":"This should sound like the reference voice.","reference_audio":"narrator.wav","alpha":0.5,"beta":0.7,"diffusion_steps":10}},"highQuality":{"summary":"High quality production settings","value":{"input":"Professional narration with maximum quality.","reference_audio":"narrator.wav","alpha":0.6,"beta":0.8,"diffusion_steps":15,"embedding_scale":1.1}}}}}},"responses":{"200":{"description":"Generated audio file","content":{"audio/wav":{"schema":{"type":"string","format":"binary"}}},"headers":{"Content-Disposition":{"schema":{"type":"string"},"example":"attachment; filename=speech.wav"}}},"400":{"description":"Invalid request parameters","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"},"examples":{"missingInput":{"value":{"error":"Missing required field: input"}},"invalidAlpha":{"value":{"error":"alpha must be between 0.0 and 1.0"}}}}}},"500":{"description":"Internal server error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"},"example":{"error":"Speech synthesis failed: CUDA out of memory"}}}}}}},"/v1/voices":{"get":{"tags":["Voices"],"summary":"List Voices","description":"Returns a list of available reference voices for voice cloning","operationId":"listVoices","responses":{"200":{"description":"List of voices","content":{"application/json":{"schema":{"$ref":"#/components/schemas/VoiceListResponse"},"example":{"voices":[{"name":"narrator","filename":"narrator.wav","size":240000},{"name":"assistant","filename":"assistant.wav","size":185000}]}}}},"500":{"description":"Internal server error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}}}}},"/v1/voices/upload":{"post":{"tags":["Voices"],"summary":"Upload Voice","description":"Upload a new reference audio file for voice cloning.\n\n## Requirements\n- **Duration**: 5-7 seconds optimal\n- **Format**: WAV, MP3, FLAC, or OGG\n- **Quality**: Clear recording with minimal background noise\n- **Content**: Natural speech with varied intonation\n","operationId":"uploadVoice","requestBody":{"required":true,"content":{"multipart/form-data":{"schema":{"type":"object","properties":{"audio":{"type":"string","format":"binary","description":"Audio file (WAV, MP3, FLAC, or OGG)"},"name":{"type":"string","description":"Custom name for the voice (optional, defaults to filename)"}},"required":["audio"]}}}},"responses":{"200":{"description":"Voice uploaded successfully","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Voice"},"example":{"name":"my_voice","filename":"my_voice.wav","size":240000}}}},"400":{"description":"Invalid request","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"},"examples":{"noFile":{"value":{"error":"No audio file provided"}},"invalidFormat":{"value":{"error":"Invalid file format. Allowed: wav, mp3, flac, ogg"}}}}}},"500":{"description":"Internal server error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}}}}},"/v1/voices/{filename}":{"delete":{"tags":["Voices"],"summary":"Delete Voice","description":"Delete a reference audio file","operationId":"deleteVoice","parameters":[{"name":"filename","in":"path","required":true,"description":"Filename of the voice to delete (including extension)","schema":{"type":"string"},"example":"my_voice.wav"}],"responses":{"200":{"description":"Voice deleted successfully","content":{"application/json":{"schema":{"type":"object","properties":{"message":{"type":"string"},"filename":{"type":"string"}}},"example":{"message":"Voice deleted","filename":"my_voice.wav"}}}},"404":{"description":"Voice not found","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"},"example":{"error":"Voice not found"}}}},"500":{"description":"Internal server error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/Error"}}}}}}}},"components":{"schemas":{"SpeechRequest":{"type":"object","required":["input"],"properties":{"input":{"type":"string","description":"Text to synthesize (max ~1000 characters recommended)","maxLength":5000,"example":"Hello, this is a test of the StyleTTS2 system."},"reference_audio":{"type":"string","description":"Filename of reference voice for cloning (from /v1/voices)","nullable":true,"example":"narrator.wav"},"alpha":{"type":"number","format":"float","minimum":0.0,"maximum":1.0,"default":0.3,"description":"Timbre control (0.0-1.0). Higher values make output sound more like the reference voice.\n- 0.0-0.2: Minimal reference influence\n- 0.3-0.4: Subtle cloning (default)\n- 0.5-0.6: Moderate cloning\n- 0.7-1.0: Strong voice matching\n"},"beta":{"type":"number","format":"float","minimum":0.0,"maximum":1.0,"default":0.7,"description":"Prosody control (0.0-1.0). Higher values add more expression and intonation variation.\n- 0.0-0.3: Flat, monotone delivery\n- 0.4-0.6: Neutral, measured speech\n- 0.7-0.8: Natural, varied (default)\n- 0.9-1.0: Highly expressive\n"},"diffusion_steps":{"type":"integer","minimum":1,"maximum":20,"default":5,"description":"Number of diffusion steps. More steps = higher quality but slower synthesis.\n- 1-2: Fast preview (~0.5s)\n- 3-5: Balanced (default, ~1.5s)\n- 6-10: High quality (~3s)\n- 11-20: Maximum quality (~5-6s)\n"},"embedding_scale":{"type":"number","format":"float","minimum":0.5,"maximum":2.0,"default":1.0,"description":"Emotional intensity multiplier.\n- 0.5-0.8: Subdued, calm\n- 0.9-1.1: Natural (default)\n- 1.2-2.0: Enhanced emotion\n"},"speed":{"type":"number","format":"float","minimum":0.5,"maximum":2.0,"default":1.0,"description":"Speech rate adjustment (post-processing time stretch)"},"response_format":{"type":"string","enum":["wav"],"default":"wav","description":"Output audio format (currently only WAV supported)"}}},"VoiceListResponse":{"type":"object","properties":{"voices":{"type":"array","items":{"$ref":"#/components/schemas/Voice"}}}},"Voice":{"type":"object","properties":{"name":{"type":"string","description":"Voice name (filename without extension)","example":"narrator"},"filename":{"type":"string","description":"Full filename with extension","example":"narrator.wav"},"size":{"type":"integer","description":"File size in bytes","example":240000}}},"HealthResponse":{"type":"object","properties":{"status":{"type":"string","enum":["healthy","unhealthy"],"example":"healthy"},"model_loaded":{"type":"boolean","description":"Whether the StyleTTS2 model is loaded and ready","example":true},"device":{"type":"string","description":"Compute device (cuda or cpu)","example":"cuda"},"gpu_available":{"type":"boolean","description":"Whether GPU is available","example":true},"gpu_name":{"type":"string","description":"GPU model name (if available)","example":"NVIDIA RTX PRO 6000 Blackwell"},"gpu_memory":{"type":"object","properties":{"allocated":{"type":"string","description":"Currently allocated GPU memory","example":"0.72 GB"},"reserved":{"type":"string","description":"Reserved GPU memory","example":"0.74 GB"}}}}},"ApiInfo":{"type":"object","properties":{"service":{"type":"string","example":"StyleTTS2 API"},"version":{"type":"string","example":"1.0.0"},"endpoints":{"type":"object","additionalProperties":{"type":"string"}}}},"Error":{"type":"object","properties":{"error":{"type":"string","description":"Error message","example":"Missing required field: input"}}}}}}