Full-duplex voice pipeline gateway for the Haiven AI platform. Routes audio through the STT → orchestrator → TTS pipeline and returns spoken responses. Provides three interaction modes (voice-to-voice, text-to-voice, voice note capture) plus a confirm flow for voice-driven approval of pending actions such as email sends and calendar creates.
| Property | Value |
|---|---|
| Container | haven-voice-gateway |
| Image | haiven/haven-voice-gateway:latest |
| Host Port | 8490 |
| Container Port | 8000 |
| Domain | voice.haiven.site (HTTPS, SSO-protected) |
| Auth | Authentik SSO (authentik-secure-chain@file) |
| Networks | web, backend |
| GPU | None (CPU-only orchestration) |
| Memory | 256M limit / 64M reservation |
| CPU | 1 core limit / 0.25 reservation |
| User | 1000:1000 |
| Source | /mnt/apps/src/haven-voice-gateway |
| Log Rotation | 20MB × 3 files |
flowchart LR
A([User Audio]) --> GW[haven-voice-gateway :8490]
GW -->|POST /transcribe| STT[haiven-transcribe :8000]
STT -->|transcript| GW
GW -->|POST /orchestrate| ORCH[haiven-orchestrator :8000]
ORCH -->|response text + intent| GW
GW -->|POST /tts| TTS[haven-tts-gateway :8000]
TTS -->|WAV stream| GW
GW -->|Streaming WAV| B([User Speaker])
GW -->|GET/POST /api/v1/...| WH[work-hub :8030]
The gateway is a thin orchestration layer — it holds no state, performs no ML inference, and touches no audio storage. Its job is sequencing upstream calls and streaming the result back to the caller.
| Mode | Endpoint | Input | Output |
|---|---|---|---|
| Voice-to-voice | POST /voice |
Audio file (multipart) | Streaming WAV |
| Text-to-voice | POST /voice/text |
JSON {text} |
Streaming WAV |
| Voice note | POST /voice/note |
Audio file (multipart) | JSON confirmation |
| Confirm preview | GET /confirm/pending/{task_id} |
Task ID | JSON TTS preview |
| Confirm action | POST /confirm/action |
JSON {task_id, action} |
JSON result |
| Stage | Typical | Target |
|---|---|---|
| STT (haiven-transcribe) | ~800ms | — |
| Orchestrator (haiven-orchestrator) | ~1500ms | — |
| TTS (haven-tts-gateway) | ~120ms | — |
| Total end-to-end | ~2420ms | 3200ms |
Cold path (model loading) is excluded from the 3.2s target; warm path assumes all upstream services already have models loaded.
Application source lives at /mnt/apps/src/haven-voice-gateway. Inside the container it is mounted at /app/app/.
/app/app/
├── main.py # FastAPI app: /voice, /voice/text, /voice/note, /health
├── config.py # Pydantic Settings (VOICE_ prefix)
├── confirm_flow.py # Confirm flow router: /confirm/pending, /confirm/action
├── stt_client.py # HTTP client for haiven-transcribe
├── tts_client.py # HTTP client for haven-tts-gateway
└── orchestrator_client.py # HTTP client for haiven-orchestrator
All variables use the VOICE_ prefix. Defaults are set in docker-compose.yml; sensitive overrides go in .env.
| Variable | Default | Description |
|---|---|---|
VOICE_STT_URL |
http://haiven-transcribe:8000 |
STT backend base URL |
VOICE_TTS_URL |
http://haven-tts-gateway:8000 |
TTS backend base URL |
VOICE_ORCHESTRATOR_URL |
http://haiven-orchestrator:8000 |
Orchestrator base URL |
VOICE_WORKHUB_URL |
http://work-hub:8030 |
work-hub base URL (confirm flow) |
VOICE_TTS_STYLE |
fast |
Default TTS rendering style passed to haven-tts-gateway |
VOICE_LOG_LEVEL |
INFO |
Python logging level (DEBUG, INFO, WARNING, ERROR) |
Additional inherited environment variables (set in compose):
| Variable | Value | Purpose |
|---|---|---|
PYTHONUNBUFFERED |
1 |
Real-time log streaming |
DO_NOT_TRACK |
1 |
Disable telemetry in frameworks |
TZ |
America/New_York |
Container timezone |
Full voice pipeline. Accepts uploaded audio, transcribes it, routes the transcript through the orchestrator, synthesizes the response, and streams back WAV audio.
Request: multipart/form-data
| Field | Type | Required | Description |
|---|---|---|---|
file |
binary (audio) | Yes | Audio file in any format accepted by haiven-transcribe (WAV, MP3, FLAC, OGG, etc.) |
session_id |
string | No | UUID to maintain conversation context across turns; generated if omitted |
Response: audio/wav (streaming)
Response headers:
| Header | Description |
|---|---|
X-Request-Id |
Unique ID for this request, for log correlation |
X-Total-Latency-Ms |
Wall-clock time for the entire pipeline (ms) |
X-STT-Latency-Ms |
Time spent in haiven-transcribe (ms) |
X-Orch-Latency-Ms |
Time spent in haiven-orchestrator (ms) |
X-TTS-Latency-Ms |
Time spent in haven-tts-gateway (ms) |
X-Intent |
Intent label returned by the orchestrator (e.g. calendar_query, general_chat) |
Privacy note: Audio bytes are zeroed in memory (b"\x00" * len) and the reference deleted immediately after STT processing completes. No audio data is written to disk at any point.
Error responses:
| Status | Condition |
|---|---|
| 422 | No speech detected in audio |
| 502 | STT, orchestrator, or TTS upstream failed |
Example:
curl -s -X POST https://voice.haiven.site/voice \
-F "file=@recording.wav" \
-F "session_id=abc-123" \
--output response.wav \
-D -
Text-to-voice pipeline. Skips STT; sends text directly to the orchestrator and returns spoken audio.
Request: application/json
{
"text": "What's on my calendar today?",
"session_id": "optional-uuid"
}
| Field | Type | Required | Description |
|---|---|---|---|
text |
string | Yes | Text to route through the orchestrator |
session_id |
string | No | Conversation session UUID |
Response: audio/wav (streaming) with the same latency headers as POST /voice (minus X-STT-Latency-Ms).
Example:
curl -s -X POST https://voice.haiven.site/voice/text \
-H "Content-Type: application/json" \
-d '{"text": "Set a timer for 10 minutes"}' \
--output response.wav
Voice note capture. Transcribes uploaded audio and routes it to the orchestrator under the voice_note intent (by prepending "Note to self: " to the transcript). Returns a JSON confirmation rather than audio — suitable for quick capture flows where playback is not needed.
Request: multipart/form-data
| Field | Type | Required | Description |
|---|---|---|---|
file |
binary (audio) | Yes | Audio recording of the note |
Response: application/json
{
"transcript": "The vendor agreed to 30 days net terms",
"ingested": true,
"message": "Note recorded."
}
| Field | Type | Description |
|---|---|---|
transcript |
string | Raw STT output |
ingested |
boolean | true when intent resolved to voice_note and no clarification was needed |
message |
string | Human-readable status or orchestrator response content |
Example:
curl -s -X POST https://voice.haiven.site/voice/note \
-F "file=@note.wav" | jq .
Fetches a pending artifact from work-hub and returns a TTS-friendly preview string. Used by the voice pipeline to read a draft aloud before asking the user to confirm or cancel.
Path parameter: task_id — work-hub task UUID
Response: application/json
{
"task_id": "abc-123",
"artifact_type": "email",
"tts_preview": "Email to alice@example.com. Subject: Q2 report. Preview: Here is the summary you requested.",
"status": "pending_review"
}
| Field | Type | Description |
|---|---|---|
task_id |
string | Task UUID echoed back |
artifact_type |
string | email, calendar, or draft |
tts_preview |
string | Short human-readable preview suitable for TTS readback |
status |
string | Artifact status from work-hub (pending_review or review) |
Error responses:
| Status | Condition |
|---|---|
| 404 | Task not found, or task has no pending artifacts |
| 409 | Artifact status is not pending_review or review (already executed or cancelled) |
Example:
curl -s https://voice.haiven.site/confirm/pending/abc-123 | jq .
Executes or cancels a pending action. The confirm flow state machine transitions:
pending_review → (confirm) → executed
pending_review → (cancel) → cancelled
Request: application/json
{
"task_id": "abc-123",
"action": "confirm"
}
| Field | Type | Required | Description |
|---|---|---|---|
task_id |
string | Yes | work-hub task UUID |
action |
string | Yes | confirm to execute, cancel to discard |
Response: application/json
{
"status": "executed",
"action": "confirm",
"detail": "Email sent. Message ID: msg_xyz",
"tts_preview": "Email to alice@example.com. Subject: Q2 report. Preview: ..."
}
Confirm behaviour by artifact type:
| Artifact type | Action taken |
|---|---|
email |
Calls POST /api/v1/email/send on work-hub |
calendar |
Calls POST /api/v1/calendar/events on work-hub |
draft |
Marks artifact as approved (no external call) |
After execution, work-hub task status is patched to done. On cancel, task status is patched to cancelled.
Error responses:
| Status | Condition |
|---|---|
| 400 | action is not confirm or cancel |
| 404 | Task not found or has no artifacts |
| 409 | Calendar time slot conflict (calendar artifact only) |
| 429 | Email rate limit exceeded |
| 502 | Upstream execution failure (work-hub unreachable or returned error) |
Example:
# Confirm (send email / create event)
curl -s -X POST https://voice.haiven.site/confirm/action \
-H "Content-Type: application/json" \
-d '{"task_id": "abc-123", "action": "confirm"}' | jq .
# Cancel
curl -s -X POST https://voice.haiven.site/confirm/action \
-H "Content-Type: application/json" \
-d '{"task_id": "abc-123", "action": "cancel"}' | jq .
Checks connectivity to all three upstream services and reports per-service status.
Response: application/json
{
"status": "healthy",
"stt": "up",
"tts": "up",
"orchestrator": "up"
}
status is "healthy" when all three upstreams respond. It is "degraded" if any upstream is unreachable. Individual fields reflect per-service state ("up" or "down").
Note: The health endpoint does not probe work-hub — confirm flow availability is not reflected here.
Example:
curl -s https://voice.haiven.site/health | jq .
Prometheus metrics endpoint. Scraped automatically by Prometheus via the prometheus.scrape=true Docker label.
| Service | Role | Internal Address | Host Port |
|---|---|---|---|
haiven-transcribe |
Speech-to-text (tri-engine: Canary, Parakeet, Whisper Turbo + pyannote diarizer) | http://haiven-transcribe:8000 |
— |
haiven-orchestrator |
Intent classification and agent dispatch | http://haiven-orchestrator:8000 |
8500 |
haven-tts-gateway |
Text-to-speech synthesis | http://haven-tts-gateway:8000 |
8485 |
work-hub |
Task store for confirm flow (email send, calendar create) | http://work-hub:8030 |
8030 |
All services must be reachable on the backend Docker network. The /health endpoint reflects STT, TTS, and orchestrator status. work-hub failures surface as 502 errors on confirm flow endpoints.
HTTPS: voice.haiven.site → websecure entrypoint
middleware: authentik-secure-chain@file (SSO)
backend: haven-voice-gateway:8000
HTTP: voice.haiven.site → web entrypoint
middleware: voice-gateway-redirect (301 → HTTPS)
The service is protected by Authentik SSO. All requests must carry a valid session cookie or be forwarded from a trusted internal caller with an Authentik forward-auth token.
# Follow live logs
docker logs -f haven-voice-gateway
# Last 100 lines
docker logs --tail 100 haven-voice-gateway
# With timestamps
docker logs -f -t haven-voice-gateway
Log rotation: 20MB per file, 3 files retained (driver: json-file).
To increase verbosity, set VOICE_LOG_LEVEL=DEBUG in the .env file and restart the container.
Prometheus scrapes /metrics at haven-voice-gateway:8000/metrics. Latency headers on each response (X-*-Latency-Ms) can be used to derive per-stage histogram data in Grafana.
Docker runs the health check every 30 seconds with a 10-second timeout. The container is marked unhealthy after 3 consecutive failures. start_period is 15 seconds to allow startup time before health checks begin.
Configuration (from docker-compose.yml):
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 15s
Manual health check:
docker inspect haven-voice-gateway --format '{{.State.Health.Status}}'
# Or test the endpoint directly
docker exec haven-voice-gateway curl -f http://localhost:8000/health
cd /mnt/apps/docker/ai/haven-voice-gateway
# Start
docker compose up -d
# Stop
docker compose down
# Restart (rolling — no downtime for a single-instance service)
docker compose restart haven-voice-gateway
# Force recreate (picks up compose changes)
docker compose up -d --force-recreate haven-voice-gateway
cd /mnt/apps/docker/ai/haven-voice-gateway
docker compose build --no-cache haven-voice-gateway
docker compose up -d --force-recreate haven-voice-gateway
# From inside the container
docker exec haven-voice-gateway curl -s http://haiven-transcribe:8000/health
docker exec haven-voice-gateway curl -s http://haven-tts-gateway:8000/health
docker exec haven-voice-gateway curl -s http://haiven-orchestrator:8000/health
docker exec haven-voice-gateway curl -s http://work-hub:8030/health
502 Bad Gateway from Traefik
The container is unhealthy or hasn't passed its start_period. Check: docker ps (look for (unhealthy) or (starting)), then docker logs haven-voice-gateway.
"status": "degraded" on /health
One or more upstreams are unreachable. Check each service is running and on the backend network:
docker inspect haven-voice-gateway --format '{{range .NetworkSettings.Networks}}{{.NetworkID}} {{end}}'
docker network inspect backend --format '{{range .Containers}}{{.Name}} {{end}}'
422 "No speech detected in audio"
The STT engine received audio but found no speech content. Check microphone levels, background noise, or try a higher sample rate (16kHz+).
High latency / timeouts
STT and orchestrator are the dominant latency contributors. If X-Orch-Latency-Ms is very high, the orchestrator's LLM backend (llama-swap or vLLM) may be under load or loading a model cold. If X-STT-Latency-Ms is high, check haiven-transcribe GPU utilization.
Audio response is silent or malformed
Enable VOICE_LOG_LEVEL=DEBUG and re-send the request. Look for TTS errors in the logs. Verify haven-tts-gateway is responding correctly:
docker exec haven-voice-gateway curl -s http://haven-tts-gateway:8000/health
Confirm flow returns 404 for task_id
The task does not exist in work-hub, or has no artifacts. Verify the task ID is correct and work-hub is reachable:
docker exec haven-voice-gateway curl -s http://work-hub:8030/health
Confirm flow returns 409 (wrong status)
The artifact has already been executed or cancelled. Each task can only be confirmed or cancelled once.
b"\x00" * len) after STT use and immediately dereferenced.