haven-voice-gateway

Full-duplex voice pipeline gateway for the Haiven AI platform. Routes audio through the STT → orchestrator → TTS pipeline and returns spoken responses. Provides three interaction modes (voice-to-voice, text-to-voice, voice note capture) plus a confirm flow for voice-driven approval of pending actions such as email sends and calendar creates.

Quick Reference

Property	Value
Container	`haven-voice-gateway`
Image	`haiven/haven-voice-gateway:latest`
Host Port	`8490`
Container Port	`8000`
Domain	`voice.haiven.site` (HTTPS, SSO-protected)
Auth	Authentik SSO (`authentik-secure-chain@file`)
Networks	`web`, `backend`
GPU	None (CPU-only orchestration)
Memory	256M limit / 64M reservation
CPU	1 core limit / 0.25 reservation
User	`1000:1000`
Source	`/mnt/apps/src/haven-voice-gateway`
Log Rotation	20MB × 3 files

Architecture

flowchart LR
    A([User Audio]) --> GW[haven-voice-gateway :8490]
    GW -->|POST /transcribe| STT[haiven-transcribe :8000]
    STT -->|transcript| GW
    GW -->|POST /orchestrate| ORCH[haiven-orchestrator :8000]
    ORCH -->|response text + intent| GW
    GW -->|POST /tts| TTS[haven-tts-gateway :8000]
    TTS -->|WAV stream| GW
    GW -->|Streaming WAV| B([User Speaker])
    GW -->|GET/POST /api/v1/...| WH[work-hub :8030]

The gateway is a thin orchestration layer — it holds no state, performs no ML inference, and touches no audio storage. Its job is sequencing upstream calls and streaming the result back to the caller.

Pipeline Modes

Mode	Endpoint	Input	Output
Voice-to-voice	`POST /voice`	Audio file (multipart)	Streaming WAV
Text-to-voice	`POST /voice/text`	JSON `{text}`	Streaming WAV
Voice note	`POST /voice/note`	Audio file (multipart)	JSON confirmation
Confirm preview	`GET /confirm/pending/{task_id}`	Task ID	JSON TTS preview
Confirm action	`POST /confirm/action`	JSON `{task_id, action}`	JSON result

Latency Budget (warm path)

Stage	Typical	Target
STT (haiven-transcribe)	~800ms	—
Orchestrator (haiven-orchestrator)	~1500ms	—
TTS (haven-tts-gateway)	~120ms	—
Total end-to-end	~2420ms	3200ms

Cold path (model loading) is excluded from the 3.2s target; warm path assumes all upstream services already have models loaded.

Source Layout

Application source lives at /mnt/apps/src/haven-voice-gateway. Inside the container it is mounted at /app/app/.

/app/app/
├── main.py                 # FastAPI app: /voice, /voice/text, /voice/note, /health
├── config.py               # Pydantic Settings (VOICE_ prefix)
├── confirm_flow.py         # Confirm flow router: /confirm/pending, /confirm/action
├── stt_client.py           # HTTP client for haiven-transcribe
├── tts_client.py           # HTTP client for haven-tts-gateway
└── orchestrator_client.py  # HTTP client for haiven-orchestrator

Configuration

All variables use the VOICE_ prefix. Defaults are set in docker-compose.yml; sensitive overrides go in .env.

Variable	Default	Description
`VOICE_STT_URL`	`http://haiven-transcribe:8000`	STT backend base URL
`VOICE_TTS_URL`	`http://haven-tts-gateway:8000`	TTS backend base URL
`VOICE_ORCHESTRATOR_URL`	`http://haiven-orchestrator:8000`	Orchestrator base URL
`VOICE_WORKHUB_URL`	`http://work-hub:8030`	work-hub base URL (confirm flow)
`VOICE_TTS_STYLE`	`fast`	Default TTS rendering style passed to haven-tts-gateway
`VOICE_LOG_LEVEL`	`INFO`	Python logging level (`DEBUG`, `INFO`, `WARNING`, `ERROR`)

Additional inherited environment variables (set in compose):

Variable	Value	Purpose
`PYTHONUNBUFFERED`	`1`	Real-time log streaming
`DO_NOT_TRACK`	`1`	Disable telemetry in frameworks
`TZ`	`America/New_York`	Container timezone

API Reference

POST /voice

Full voice pipeline. Accepts uploaded audio, transcribes it, routes the transcript through the orchestrator, synthesizes the response, and streams back WAV audio.

Request: multipart/form-data

Field	Type	Required	Description
`file`	binary (audio)	Yes	Audio file in any format accepted by haiven-transcribe (WAV, MP3, FLAC, OGG, etc.)
`session_id`	string	No	UUID to maintain conversation context across turns; generated if omitted

Response: audio/wav (streaming)

Response headers:

Header	Description
`X-Request-Id`	Unique ID for this request, for log correlation
`X-Total-Latency-Ms`	Wall-clock time for the entire pipeline (ms)
`X-STT-Latency-Ms`	Time spent in haiven-transcribe (ms)
`X-Orch-Latency-Ms`	Time spent in haiven-orchestrator (ms)
`X-TTS-Latency-Ms`	Time spent in haven-tts-gateway (ms)
`X-Intent`	Intent label returned by the orchestrator (e.g. `calendar_query`, `general_chat`)

Privacy note: Audio bytes are zeroed in memory (b"\x00" * len) and the reference deleted immediately after STT processing completes. No audio data is written to disk at any point.

Error responses:

Status	Condition
422	No speech detected in audio
502	STT, orchestrator, or TTS upstream failed

Example:

curl -s -X POST https://voice.haiven.site/voice \
  -F "file=@recording.wav" \
  -F "session_id=abc-123" \
  --output response.wav \
  -D -

POST /voice/text

Text-to-voice pipeline. Skips STT; sends text directly to the orchestrator and returns spoken audio.

Request: application/json

{
  "text": "What's on my calendar today?",
  "session_id": "optional-uuid"
}

Field	Type	Required	Description
`text`	string	Yes	Text to route through the orchestrator
`session_id`	string	No	Conversation session UUID

Response: audio/wav (streaming) with the same latency headers as POST /voice (minus X-STT-Latency-Ms).

Example:

curl -s -X POST https://voice.haiven.site/voice/text \
  -H "Content-Type: application/json" \
  -d '{"text": "Set a timer for 10 minutes"}' \
  --output response.wav

POST /voice/note

Voice note capture. Transcribes uploaded audio and routes it to the orchestrator under the voice_note intent (by prepending "Note to self: " to the transcript). Returns a JSON confirmation rather than audio — suitable for quick capture flows where playback is not needed.

Request: multipart/form-data

Field	Type	Required	Description
`file`	binary (audio)	Yes	Audio recording of the note

Response: application/json

{
  "transcript": "The vendor agreed to 30 days net terms",
  "ingested": true,
  "message": "Note recorded."
}

Field	Type	Description
`transcript`	string	Raw STT output
`ingested`	boolean	`true` when intent resolved to `voice_note` and no clarification was needed
`message`	string	Human-readable status or orchestrator response content

Example:

curl -s -X POST https://voice.haiven.site/voice/note \
  -F "file=@note.wav" | jq .

GET /confirm/pending/{task_id}

Fetches a pending artifact from work-hub and returns a TTS-friendly preview string. Used by the voice pipeline to read a draft aloud before asking the user to confirm or cancel.

Path parameter: task_id — work-hub task UUID

Response: application/json

{
  "task_id": "abc-123",
  "artifact_type": "email",
  "tts_preview": "Email to alice@example.com. Subject: Q2 report. Preview: Here is the summary you requested.",
  "status": "pending_review"
}

Field	Type	Description
`task_id`	string	Task UUID echoed back
`artifact_type`	string	`email`, `calendar`, or `draft`
`tts_preview`	string	Short human-readable preview suitable for TTS readback
`status`	string	Artifact status from work-hub (`pending_review` or `review`)

Error responses:

Status	Condition
404	Task not found, or task has no pending artifacts
409	Artifact status is not `pending_review` or `review` (already executed or cancelled)

Example:

curl -s https://voice.haiven.site/confirm/pending/abc-123 | jq .

POST /confirm/action

Executes or cancels a pending action. The confirm flow state machine transitions:

pending_review → (confirm) → executed
pending_review → (cancel)  → cancelled

Request: application/json

{
  "task_id": "abc-123",
  "action": "confirm"
}

Field	Type	Required	Description
`task_id`	string	Yes	work-hub task UUID
`action`	string	Yes	`confirm` to execute, `cancel` to discard

Response: application/json

{
  "status": "executed",
  "action": "confirm",
  "detail": "Email sent. Message ID: msg_xyz",
  "tts_preview": "Email to alice@example.com. Subject: Q2 report. Preview: ..."
}

Confirm behaviour by artifact type:

Artifact type	Action taken
`email`	Calls `POST /api/v1/email/send` on work-hub
`calendar`	Calls `POST /api/v1/calendar/events` on work-hub
`draft`	Marks artifact as approved (no external call)

After execution, work-hub task status is patched to done. On cancel, task status is patched to cancelled.

Error responses:

Status	Condition
400	`action` is not `confirm` or `cancel`
404	Task not found or has no artifacts
409	Calendar time slot conflict (calendar artifact only)
429	Email rate limit exceeded
502	Upstream execution failure (work-hub unreachable or returned error)

Example:

# Confirm (send email / create event)
curl -s -X POST https://voice.haiven.site/confirm/action \
  -H "Content-Type: application/json" \
  -d '{"task_id": "abc-123", "action": "confirm"}' | jq .

# Cancel
curl -s -X POST https://voice.haiven.site/confirm/action \
  -H "Content-Type: application/json" \
  -d '{"task_id": "abc-123", "action": "cancel"}' | jq .

GET /health

Checks connectivity to all three upstream services and reports per-service status.

Response: application/json

{
  "status": "healthy",
  "stt": "up",
  "tts": "up",
  "orchestrator": "up"
}

status is "healthy" when all three upstreams respond. It is "degraded" if any upstream is unreachable. Individual fields reflect per-service state ("up" or "down").

Note: The health endpoint does not probe work-hub — confirm flow availability is not reflected here.

Example:

curl -s https://voice.haiven.site/health | jq .

GET /metrics

Prometheus metrics endpoint. Scraped automatically by Prometheus via the prometheus.scrape=true Docker label.

Upstream Dependencies

Service	Role	Internal Address	Host Port
`haiven-transcribe`	Speech-to-text (tri-engine: Canary, Parakeet, Whisper Turbo + pyannote diarizer)	`http://haiven-transcribe:8000`	—
`haiven-orchestrator`	Intent classification and agent dispatch	`http://haiven-orchestrator:8000`	8500
`haven-tts-gateway`	Text-to-speech synthesis	`http://haven-tts-gateway:8000`	8485
`work-hub`	Task store for confirm flow (email send, calendar create)	`http://work-hub:8030`	8030

All services must be reachable on the backend Docker network. The /health endpoint reflects STT, TTS, and orchestrator status. work-hub failures surface as 502 errors on confirm flow endpoints.

Traefik Routing

HTTPS: voice.haiven.site → websecure entrypoint
       middleware: authentik-secure-chain@file (SSO)
       backend: haven-voice-gateway:8000

HTTP:  voice.haiven.site → web entrypoint
       middleware: voice-gateway-redirect (301 → HTTPS)

The service is protected by Authentik SSO. All requests must carry a valid session cookie or be forwarded from a trusted internal caller with an Authentik forward-auth token.

Observability

Logs

# Follow live logs
docker logs -f haven-voice-gateway

# Last 100 lines
docker logs --tail 100 haven-voice-gateway

# With timestamps
docker logs -f -t haven-voice-gateway

Log rotation: 20MB per file, 3 files retained (driver: json-file).

To increase verbosity, set VOICE_LOG_LEVEL=DEBUG in the .env file and restart the container.

Metrics

Prometheus scrapes /metrics at haven-voice-gateway:8000/metrics. Latency headers on each response (X-*-Latency-Ms) can be used to derive per-stage histogram data in Grafana.

Health Check

Docker runs the health check every 30 seconds with a 10-second timeout. The container is marked unhealthy after 3 consecutive failures. start_period is 15 seconds to allow startup time before health checks begin.

Configuration (from docker-compose.yml):

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 15s

Manual health check:

docker inspect haven-voice-gateway --format '{{.State.Health.Status}}'

# Or test the endpoint directly
docker exec haven-voice-gateway curl -f http://localhost:8000/health

Operations

Start / Stop / Restart

cd /mnt/apps/docker/ai/haven-voice-gateway

# Start
docker compose up -d

# Stop
docker compose down

# Restart (rolling — no downtime for a single-instance service)
docker compose restart haven-voice-gateway

# Force recreate (picks up compose changes)
docker compose up -d --force-recreate haven-voice-gateway

Rebuild After Source Changes

cd /mnt/apps/docker/ai/haven-voice-gateway
docker compose build --no-cache haven-voice-gateway
docker compose up -d --force-recreate haven-voice-gateway

Verify Upstream Connectivity

# From inside the container
docker exec haven-voice-gateway curl -s http://haiven-transcribe:8000/health
docker exec haven-voice-gateway curl -s http://haven-tts-gateway:8000/health
docker exec haven-voice-gateway curl -s http://haiven-orchestrator:8000/health
docker exec haven-voice-gateway curl -s http://work-hub:8030/health

Common Issues

502 Bad Gateway from Traefik
The container is unhealthy or hasn't passed its start_period. Check: docker ps (look for (unhealthy) or (starting)), then docker logs haven-voice-gateway.

"status": "degraded" on /health
One or more upstreams are unreachable. Check each service is running and on the backend network:

docker inspect haven-voice-gateway --format '{{range .NetworkSettings.Networks}}{{.NetworkID}} {{end}}'
docker network inspect backend --format '{{range .Containers}}{{.Name}} {{end}}'

422 "No speech detected in audio"
The STT engine received audio but found no speech content. Check microphone levels, background noise, or try a higher sample rate (16kHz+).

High latency / timeouts
STT and orchestrator are the dominant latency contributors. If X-Orch-Latency-Ms is very high, the orchestrator's LLM backend (llama-swap or vLLM) may be under load or loading a model cold. If X-STT-Latency-Ms is high, check haiven-transcribe GPU utilization.

Audio response is silent or malformed
Enable VOICE_LOG_LEVEL=DEBUG and re-send the request. Look for TTS errors in the logs. Verify haven-tts-gateway is responding correctly:

docker exec haven-voice-gateway curl -s http://haven-tts-gateway:8000/health

Confirm flow returns 404 for task_id
The task does not exist in work-hub, or has no artifacts. Verify the task ID is correct and work-hub is reachable:

docker exec haven-voice-gateway curl -s http://work-hub:8030/health

Confirm flow returns 409 (wrong status)
The artifact has already been executed or cancelled. Each task can only be confirmed or cancelled once.

Security

All external traffic is HTTPS (TLS via Traefik).
HTTP is permanently redirected to HTTPS (301).
Access requires a valid Authentik SSO session.
Audio data is never written to disk — zeroed in memory (b"\x00" * len) after STT use and immediately dereferenced.
The container runs as UID/GID 1000 (non-root).
No GPU access — attack surface for CUDA-based exploits is zero.
Confirm flow actions are idempotent-safe: work-hub enforces single-execution via status transitions.