haven-voice-gateway

Full-duplex voice pipeline gateway for the Haiven AI platform. Routes audio through the STT → orchestrator → TTS pipeline and returns spoken responses. Provides three interaction modes (voice-to-voice, text-to-voice, voice note capture) plus a confirm flow for voice-driven approval of pending actions such as email sends and calendar creates.

Quick Reference

Property Value
Container haven-voice-gateway
Image haiven/haven-voice-gateway:latest
Host Port 8490
Container Port 8000
Domain voice.haiven.site (HTTPS, SSO-protected)
Auth Authentik SSO (authentik-secure-chain@file)
Networks web, backend
GPU None (CPU-only orchestration)
Memory 256M limit / 64M reservation
CPU 1 core limit / 0.25 reservation
User 1000:1000
Source /mnt/apps/src/haven-voice-gateway
Log Rotation 20MB × 3 files

Architecture

flowchart LR
    A([User Audio]) --> GW[haven-voice-gateway :8490]
    GW -->|POST /transcribe| STT[haiven-transcribe :8000]
    STT -->|transcript| GW
    GW -->|POST /orchestrate| ORCH[haiven-orchestrator :8000]
    ORCH -->|response text + intent| GW
    GW -->|POST /tts| TTS[haven-tts-gateway :8000]
    TTS -->|WAV stream| GW
    GW -->|Streaming WAV| B([User Speaker])
    GW -->|GET/POST /api/v1/...| WH[work-hub :8030]

The gateway is a thin orchestration layer — it holds no state, performs no ML inference, and touches no audio storage. Its job is sequencing upstream calls and streaming the result back to the caller.

Pipeline Modes

Mode Endpoint Input Output
Voice-to-voice POST /voice Audio file (multipart) Streaming WAV
Text-to-voice POST /voice/text JSON {text} Streaming WAV
Voice note POST /voice/note Audio file (multipart) JSON confirmation
Confirm preview GET /confirm/pending/{task_id} Task ID JSON TTS preview
Confirm action POST /confirm/action JSON {task_id, action} JSON result

Latency Budget (warm path)

Stage Typical Target
STT (haiven-transcribe) ~800ms
Orchestrator (haiven-orchestrator) ~1500ms
TTS (haven-tts-gateway) ~120ms
Total end-to-end ~2420ms 3200ms

Cold path (model loading) is excluded from the 3.2s target; warm path assumes all upstream services already have models loaded.

Source Layout

Application source lives at /mnt/apps/src/haven-voice-gateway. Inside the container it is mounted at /app/app/.

/app/app/
├── main.py                 # FastAPI app: /voice, /voice/text, /voice/note, /health
├── config.py               # Pydantic Settings (VOICE_ prefix)
├── confirm_flow.py         # Confirm flow router: /confirm/pending, /confirm/action
├── stt_client.py           # HTTP client for haiven-transcribe
├── tts_client.py           # HTTP client for haven-tts-gateway
└── orchestrator_client.py  # HTTP client for haiven-orchestrator

Configuration

All variables use the VOICE_ prefix. Defaults are set in docker-compose.yml; sensitive overrides go in .env.

Variable Default Description
VOICE_STT_URL http://haiven-transcribe:8000 STT backend base URL
VOICE_TTS_URL http://haven-tts-gateway:8000 TTS backend base URL
VOICE_ORCHESTRATOR_URL http://haiven-orchestrator:8000 Orchestrator base URL
VOICE_WORKHUB_URL http://work-hub:8030 work-hub base URL (confirm flow)
VOICE_TTS_STYLE fast Default TTS rendering style passed to haven-tts-gateway
VOICE_LOG_LEVEL INFO Python logging level (DEBUG, INFO, WARNING, ERROR)

Additional inherited environment variables (set in compose):

Variable Value Purpose
PYTHONUNBUFFERED 1 Real-time log streaming
DO_NOT_TRACK 1 Disable telemetry in frameworks
TZ America/New_York Container timezone

API Reference

POST /voice

Full voice pipeline. Accepts uploaded audio, transcribes it, routes the transcript through the orchestrator, synthesizes the response, and streams back WAV audio.

Request: multipart/form-data

Field Type Required Description
file binary (audio) Yes Audio file in any format accepted by haiven-transcribe (WAV, MP3, FLAC, OGG, etc.)
session_id string No UUID to maintain conversation context across turns; generated if omitted

Response: audio/wav (streaming)

Response headers:

Header Description
X-Request-Id Unique ID for this request, for log correlation
X-Total-Latency-Ms Wall-clock time for the entire pipeline (ms)
X-STT-Latency-Ms Time spent in haiven-transcribe (ms)
X-Orch-Latency-Ms Time spent in haiven-orchestrator (ms)
X-TTS-Latency-Ms Time spent in haven-tts-gateway (ms)
X-Intent Intent label returned by the orchestrator (e.g. calendar_query, general_chat)

Privacy note: Audio bytes are zeroed in memory (b"\x00" * len) and the reference deleted immediately after STT processing completes. No audio data is written to disk at any point.

Error responses:

Status Condition
422 No speech detected in audio
502 STT, orchestrator, or TTS upstream failed

Example:

curl -s -X POST https://voice.haiven.site/voice \
  -F "file=@recording.wav" \
  -F "session_id=abc-123" \
  --output response.wav \
  -D -

POST /voice/text

Text-to-voice pipeline. Skips STT; sends text directly to the orchestrator and returns spoken audio.

Request: application/json

{
  "text": "What's on my calendar today?",
  "session_id": "optional-uuid"
}
Field Type Required Description
text string Yes Text to route through the orchestrator
session_id string No Conversation session UUID

Response: audio/wav (streaming) with the same latency headers as POST /voice (minus X-STT-Latency-Ms).

Example:

curl -s -X POST https://voice.haiven.site/voice/text \
  -H "Content-Type: application/json" \
  -d '{"text": "Set a timer for 10 minutes"}' \
  --output response.wav

POST /voice/note

Voice note capture. Transcribes uploaded audio and routes it to the orchestrator under the voice_note intent (by prepending "Note to self: " to the transcript). Returns a JSON confirmation rather than audio — suitable for quick capture flows where playback is not needed.

Request: multipart/form-data

Field Type Required Description
file binary (audio) Yes Audio recording of the note

Response: application/json

{
  "transcript": "The vendor agreed to 30 days net terms",
  "ingested": true,
  "message": "Note recorded."
}
Field Type Description
transcript string Raw STT output
ingested boolean true when intent resolved to voice_note and no clarification was needed
message string Human-readable status or orchestrator response content

Example:

curl -s -X POST https://voice.haiven.site/voice/note \
  -F "file=@note.wav" | jq .

GET /confirm/pending/{task_id}

Fetches a pending artifact from work-hub and returns a TTS-friendly preview string. Used by the voice pipeline to read a draft aloud before asking the user to confirm or cancel.

Path parameter: task_id — work-hub task UUID

Response: application/json

{
  "task_id": "abc-123",
  "artifact_type": "email",
  "tts_preview": "Email to alice@example.com. Subject: Q2 report. Preview: Here is the summary you requested.",
  "status": "pending_review"
}
Field Type Description
task_id string Task UUID echoed back
artifact_type string email, calendar, or draft
tts_preview string Short human-readable preview suitable for TTS readback
status string Artifact status from work-hub (pending_review or review)

Error responses:

Status Condition
404 Task not found, or task has no pending artifacts
409 Artifact status is not pending_review or review (already executed or cancelled)

Example:

curl -s https://voice.haiven.site/confirm/pending/abc-123 | jq .

POST /confirm/action

Executes or cancels a pending action. The confirm flow state machine transitions:

pending_review → (confirm) → executed
pending_review → (cancel)  → cancelled

Request: application/json

{
  "task_id": "abc-123",
  "action": "confirm"
}
Field Type Required Description
task_id string Yes work-hub task UUID
action string Yes confirm to execute, cancel to discard

Response: application/json

{
  "status": "executed",
  "action": "confirm",
  "detail": "Email sent. Message ID: msg_xyz",
  "tts_preview": "Email to alice@example.com. Subject: Q2 report. Preview: ..."
}

Confirm behaviour by artifact type:

Artifact type Action taken
email Calls POST /api/v1/email/send on work-hub
calendar Calls POST /api/v1/calendar/events on work-hub
draft Marks artifact as approved (no external call)

After execution, work-hub task status is patched to done. On cancel, task status is patched to cancelled.

Error responses:

Status Condition
400 action is not confirm or cancel
404 Task not found or has no artifacts
409 Calendar time slot conflict (calendar artifact only)
429 Email rate limit exceeded
502 Upstream execution failure (work-hub unreachable or returned error)

Example:

# Confirm (send email / create event)
curl -s -X POST https://voice.haiven.site/confirm/action \
  -H "Content-Type: application/json" \
  -d '{"task_id": "abc-123", "action": "confirm"}' | jq .

# Cancel
curl -s -X POST https://voice.haiven.site/confirm/action \
  -H "Content-Type: application/json" \
  -d '{"task_id": "abc-123", "action": "cancel"}' | jq .

GET /health

Checks connectivity to all three upstream services and reports per-service status.

Response: application/json

{
  "status": "healthy",
  "stt": "up",
  "tts": "up",
  "orchestrator": "up"
}

status is "healthy" when all three upstreams respond. It is "degraded" if any upstream is unreachable. Individual fields reflect per-service state ("up" or "down").

Note: The health endpoint does not probe work-hub — confirm flow availability is not reflected here.

Example:

curl -s https://voice.haiven.site/health | jq .

GET /metrics

Prometheus metrics endpoint. Scraped automatically by Prometheus via the prometheus.scrape=true Docker label.


Upstream Dependencies

Service Role Internal Address Host Port
haiven-transcribe Speech-to-text (tri-engine: Canary, Parakeet, Whisper Turbo + pyannote diarizer) http://haiven-transcribe:8000
haiven-orchestrator Intent classification and agent dispatch http://haiven-orchestrator:8000 8500
haven-tts-gateway Text-to-speech synthesis http://haven-tts-gateway:8000 8485
work-hub Task store for confirm flow (email send, calendar create) http://work-hub:8030 8030

All services must be reachable on the backend Docker network. The /health endpoint reflects STT, TTS, and orchestrator status. work-hub failures surface as 502 errors on confirm flow endpoints.

Traefik Routing

HTTPS: voice.haiven.site  websecure entrypoint
       middleware: authentik-secure-chain@file (SSO)
       backend: haven-voice-gateway:8000

HTTP:  voice.haiven.site  web entrypoint
       middleware: voice-gateway-redirect (301  HTTPS)

The service is protected by Authentik SSO. All requests must carry a valid session cookie or be forwarded from a trusted internal caller with an Authentik forward-auth token.

Observability

Logs

# Follow live logs
docker logs -f haven-voice-gateway

# Last 100 lines
docker logs --tail 100 haven-voice-gateway

# With timestamps
docker logs -f -t haven-voice-gateway

Log rotation: 20MB per file, 3 files retained (driver: json-file).

To increase verbosity, set VOICE_LOG_LEVEL=DEBUG in the .env file and restart the container.

Metrics

Prometheus scrapes /metrics at haven-voice-gateway:8000/metrics. Latency headers on each response (X-*-Latency-Ms) can be used to derive per-stage histogram data in Grafana.

Health Check

Docker runs the health check every 30 seconds with a 10-second timeout. The container is marked unhealthy after 3 consecutive failures. start_period is 15 seconds to allow startup time before health checks begin.

Configuration (from docker-compose.yml):

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 15s

Manual health check:

docker inspect haven-voice-gateway --format '{{.State.Health.Status}}'

# Or test the endpoint directly
docker exec haven-voice-gateway curl -f http://localhost:8000/health

Operations

Start / Stop / Restart

cd /mnt/apps/docker/ai/haven-voice-gateway

# Start
docker compose up -d

# Stop
docker compose down

# Restart (rolling — no downtime for a single-instance service)
docker compose restart haven-voice-gateway

# Force recreate (picks up compose changes)
docker compose up -d --force-recreate haven-voice-gateway

Rebuild After Source Changes

cd /mnt/apps/docker/ai/haven-voice-gateway
docker compose build --no-cache haven-voice-gateway
docker compose up -d --force-recreate haven-voice-gateway

Verify Upstream Connectivity

# From inside the container
docker exec haven-voice-gateway curl -s http://haiven-transcribe:8000/health
docker exec haven-voice-gateway curl -s http://haven-tts-gateway:8000/health
docker exec haven-voice-gateway curl -s http://haiven-orchestrator:8000/health
docker exec haven-voice-gateway curl -s http://work-hub:8030/health

Common Issues

502 Bad Gateway from Traefik
The container is unhealthy or hasn't passed its start_period. Check: docker ps (look for (unhealthy) or (starting)), then docker logs haven-voice-gateway.

"status": "degraded" on /health
One or more upstreams are unreachable. Check each service is running and on the backend network:

docker inspect haven-voice-gateway --format '{{range .NetworkSettings.Networks}}{{.NetworkID}} {{end}}'
docker network inspect backend --format '{{range .Containers}}{{.Name}} {{end}}'

422 "No speech detected in audio"
The STT engine received audio but found no speech content. Check microphone levels, background noise, or try a higher sample rate (16kHz+).

High latency / timeouts
STT and orchestrator are the dominant latency contributors. If X-Orch-Latency-Ms is very high, the orchestrator's LLM backend (llama-swap or vLLM) may be under load or loading a model cold. If X-STT-Latency-Ms is high, check haiven-transcribe GPU utilization.

Audio response is silent or malformed
Enable VOICE_LOG_LEVEL=DEBUG and re-send the request. Look for TTS errors in the logs. Verify haven-tts-gateway is responding correctly:

docker exec haven-voice-gateway curl -s http://haven-tts-gateway:8000/health

Confirm flow returns 404 for task_id
The task does not exist in work-hub, or has no artifacts. Verify the task ID is correct and work-hub is reachable:

docker exec haven-voice-gateway curl -s http://work-hub:8030/health

Confirm flow returns 409 (wrong status)
The artifact has already been executed or cancelled. Each task can only be confirmed or cancelled once.

Security