haven-voice-gateway — User Guide

This guide covers how to use the haven-voice-gateway service: sending voice to the AI, capturing voice notes, getting spoken responses, and confirming or cancelling pending actions such as email sends and calendar events. It is written for users of the Haiven platform, not for operators managing the service.

What Is This?

haven-voice-gateway is the voice interface for the Haiven AI. You speak (or type), and Haiven speaks back. When the AI proposes an action — sending an email, creating a calendar event — you can confirm or cancel it by voice before anything is sent.

Under the hood it connects four pieces: speech recognition (to hear you), the AI orchestrator (to think and act), speech synthesis (to reply), and work-hub (to execute confirmed actions). From your perspective, you send audio in and get audio or a confirmation back.

Getting Access

The gateway lives at https://voice.haiven.site. It is protected by Authentik SSO — you must be signed in to the Haiven platform to use it. If you hit the URL in a browser without a session, you will be redirected to the Authentik login page.

Connection details:
- URL: https://voice.haiven.site (HTTPS required)
- Certificate: Self-signed (internal TLS)
- Authentication: Authentik SSO (middleware: authentik-secure-chain@file)
- HTTP redirect: Plain HTTP requests are automatically redirected to HTTPS with a permanent 301 redirect

All API requests from code also require a valid session cookie or a service token from Authentik.

When using curl or HTTP clients with self-signed certificates, you may need to add --insecure (or equivalent) to skip certificate verification, or import the Haiven internal CA certificate.

Interaction Modes

There are five ways to interact:

Mode When to Use
Voice-to-voice You have a recording and want a spoken reply
Text-to-voice You want to type something and hear the response
Voice note You want to capture a spoken note without needing an audio reply
Confirm preview You want to hear what a pending action contains before deciding
Confirm action You want to approve or cancel a pending email send or calendar create

Mode 1: Voice-to-Voice

Send an audio recording. Get spoken audio back.

POST https://voice.haiven.site/voice
Content-Type: multipart/form-data

Fields:
  file       — your audio recording (WAV, MP3, FLAC, OGG, or M4A)
  session_id — (optional) a UUID to keep conversation context across turns

Using curl:

curl -X POST https://voice.haiven.site/voice \
  -F "file=@my_question.wav" \
  --output haiven_reply.wav

With a session ID (to continue a conversation):

curl -X POST https://voice.haiven.site/voice \
  -F "file=@followup.wav" \
  -F "session_id=550e8400-e29b-41d4-a716-446655440000" \
  --output reply2.wav

What comes back:

A streaming WAV file. Play it with any audio player (aplay, ffplay, VLC, etc.):

curl -X POST https://voice.haiven.site/voice \
  -F "file=@question.wav" | aplay -

The response headers tell you how long each stage took:

Header Meaning
X-Total-Latency-Ms Full round-trip time
X-STT-Latency-Ms Time to transcribe your audio
X-Orch-Latency-Ms Time for the AI to produce a response
X-TTS-Latency-Ms Time to synthesize the spoken reply
X-Intent What kind of request the AI detected
X-Request-Id Unique ID for this request (useful for troubleshooting)

To see the headers:

curl -X POST https://voice.haiven.site/voice \
  -F "file=@question.wav" \
  --output reply.wav \
  -D /dev/stderr

Typical latency: Around 2.5 seconds end-to-end on a warm system. First request of the day may be slower if upstream services are loading models.

Your audio is private: The gateway zeroes your audio bytes in memory immediately after transcription. Nothing is saved to disk.


Mode 2: Text-to-Voice

Type something and get a spoken response — no microphone needed.

POST https://voice.haiven.site/voice/text
Content-Type: application/json

{
  "text": "What's the weather like in New York today?",
  "session_id": "optional-uuid"
}

Using curl:

curl -X POST https://voice.haiven.site/voice/text \
  -H "Content-Type: application/json" \
  -d '{"text": "Summarize my tasks for today"}' \
  --output reply.wav

Play it immediately:

curl -s -X POST https://voice.haiven.site/voice/text \
  -H "Content-Type: application/json" \
  -d '{"text": "Tell me something interesting"}' | aplay -

The response is the same streaming WAV format as voice-to-voice, with the same latency headers (minus the STT stage, since there is no audio to transcribe).

Maintaining conversation context:

SESSION="$(python3 -c 'import uuid; print(uuid.uuid4())')"

# First turn
curl -s -X POST https://voice.haiven.site/voice/text \
  -H "Content-Type: application/json" \
  -d "{\"text\": \"What's on my calendar?\", \"session_id\": \"$SESSION\"}" \
  | aplay -

# Follow-up in the same session
curl -s -X POST https://voice.haiven.site/voice/text \
  -H "Content-Type: application/json" \
  -d "{\"text\": \"Move the 3pm meeting to 4pm\", \"session_id\": \"$SESSION\"}" \
  | aplay -

Mode 3: Voice Note

Capture a spoken note. The AI transcribes it and stores it — no spoken reply is sent back.

This mode is useful when you want to quickly dictate something (a task, a decision, an observation) and have it ingested without sitting through audio playback.

POST https://voice.haiven.site/voice/note
Content-Type: multipart/form-data

Fields:
  file — your audio recording

Using curl:

curl -X POST https://voice.haiven.site/voice/note \
  -F "file=@note.wav"

Example response:

{
  "transcript": "The vendor agreed to 30 days net terms",
  "ingested": true,
  "message": "Note recorded."
}
Field Meaning
transcript Exactly what the AI heard
ingested true if the note was successfully stored
message Human-readable confirmation or orchestrator response

If ingested is false, the transcript is still returned but something prevented the note from being stored — check with your Haiven administrator.

Quick capture with jq:

curl -s -X POST https://voice.haiven.site/voice/note \
  -F "file=@note.wav" | jq '.transcript'

Mode 4: Confirm Flow

When the AI orchestrator prepares an action that requires your explicit approval — such as sending an email or creating a calendar event — it creates a pending task in work-hub and returns the task ID. The confirm flow lets you review that pending action and approve or cancel it.

This two-step process (preview, then act) ensures you always know exactly what will be sent or created before it happens.

Step 1: Get a Preview

Fetch a human-readable summary of the pending action. The gateway formats it specifically for TTS readback so you can hear it aloud.

GET https://voice.haiven.site/confirm/pending/{task_id}

Using curl:

curl -s https://voice.haiven.site/confirm/pending/abc-123 | jq .

Example response (email):

{
  "task_id": "abc-123",
  "artifact_type": "email",
  "tts_preview": "Email to alice@example.com. Subject: Q2 report. Preview: Here is the summary you requested.",
  "status": "pending_review"
}

Example response (calendar event):

{
  "task_id": "abc-123",
  "artifact_type": "calendar",
  "tts_preview": "Calendar event: Team sync at 2026-03-02T14:00:00 with alice@example.com, bob@example.com.",
  "status": "pending_review"
}

The tts_preview field is what the voice interface will read aloud. For emails it includes the recipient, subject, and the first two sentences of the body. For calendar events it includes the title, time, and up to three attendees.

Step 2: Confirm or Cancel

Once you've reviewed the preview, send your decision:

POST https://voice.haiven.site/confirm/action
Content-Type: application/json

{
  "task_id": "abc-123",
  "action": "confirm"   ← or "cancel"
}

Confirm (execute the action):

curl -s -X POST https://voice.haiven.site/confirm/action \
  -H "Content-Type: application/json" \
  -d '{"task_id": "abc-123", "action": "confirm"}' | jq .

Example confirm response:

{
  "status": "executed",
  "action": "confirm",
  "detail": "Email sent. Message ID: msg_xyz",
  "tts_preview": "Email to alice@example.com. Subject: Q2 report. Preview: ..."
}

Cancel (discard the action):

curl -s -X POST https://voice.haiven.site/confirm/action \
  -H "Content-Type: application/json" \
  -d '{"task_id": "abc-123", "action": "cancel"}' | jq .

Example cancel response:

{
  "status": "cancelled",
  "action": "cancel",
  "detail": "Email cancelled.",
  "tts_preview": ""
}

What happens on confirm:

Pending type Action taken
Email draft Sends the email via work-hub
Calendar event Creates the event via work-hub
Generic draft Marks as approved (no external action)

Important: Each task can only be confirmed or cancelled once. If you try to act on a task that's already been executed or cancelled, you'll get a 409 error.


Checking Service Status

Before sending requests, you can verify the pipeline is fully healthy:

curl -s https://voice.haiven.site/health | jq .

Expected response when everything is running:

{
  "status": "healthy",
  "stt": "up",
  "tts": "up",
  "orchestrator": "up"
}

If you see "status": "degraded", one of the upstream AI services (transcription, orchestrator, or TTS) is temporarily unavailable. The individual fields (stt, tts, orchestrator) identify which one. Try again in a minute, or contact your Haiven administrator.

Note: the health check does not include work-hub. If confirm flow endpoints return errors but /health shows healthy, work-hub may be the issue.


Audio Format Notes

The gateway accepts any audio format that haiven-transcribe can handle. Supported formats include:

Recommended recording settings for best transcription accuracy:
- Sample rate: 16kHz or higher
- Channels: Mono (stereo works but adds no accuracy benefit)
- Bit depth: 16-bit PCM (for WAV)

If transcription results are poor, try recording at a higher sample rate or with less background noise.


Building a Simple Client

Here is a minimal Python example using httpx:

import httpx

BASE = "https://voice.haiven.site"


def ask_voice(audio_path: str, session_id: str | None = None) -> bytes:
    """Send an audio file and return the WAV response bytes."""
    with open(audio_path, "rb") as f:
        data = {"session_id": session_id} if session_id else {}
        r = httpx.post(
            f"{BASE}/voice",
            files={"file": f},
            data=data,
            timeout=30.0,
        )
        r.raise_for_status()
        print("Intent:", r.headers.get("X-Intent"))
        print("Total latency:", r.headers.get("X-Total-Latency-Ms"), "ms")
        return r.content


def ask_text(text: str, session_id: str | None = None) -> bytes:
    """Send text and return the WAV response bytes."""
    payload = {"text": text}
    if session_id:
        payload["session_id"] = session_id
    r = httpx.post(f"{BASE}/voice/text", json=payload, timeout=30.0)
    r.raise_for_status()
    return r.content


def capture_note(audio_path: str) -> dict:
    """Capture a voice note and return the JSON confirmation."""
    with open(audio_path, "rb") as f:
        r = httpx.post(f"{BASE}/voice/note", files={"file": f}, timeout=30.0)
        r.raise_for_status()
        return r.json()


def confirm_preview(task_id: str) -> dict:
    """Fetch a TTS-friendly preview of a pending action."""
    r = httpx.get(f"{BASE}/confirm/pending/{task_id}", timeout=10.0)
    r.raise_for_status()
    return r.json()


def confirm_action(task_id: str, action: str = "confirm") -> dict:
    """Confirm or cancel a pending action. action must be 'confirm' or 'cancel'."""
    r = httpx.post(
        f"{BASE}/confirm/action",
        json={"task_id": task_id, "action": action},
        timeout=30.0,
    )
    r.raise_for_status()
    return r.json()


# Example usage
wav_bytes = ask_text("What are my tasks today?")
with open("reply.wav", "wb") as f:
    f.write(wav_bytes)

note = capture_note("meeting_note.wav")
print(note["transcript"])

# Confirm flow
preview = confirm_preview("abc-123")
print("Preview:", preview["tts_preview"])
result = confirm_action("abc-123", action="confirm")
print("Result:", result["detail"])

Frequently Asked Questions

How long does a response take?

On a warm system (all services running with models loaded), expect around 2–3 seconds end-to-end for a typical short question. The speech recognition stage takes roughly 800ms, the AI orchestrator around 1.5 seconds, and TTS synthesis around 120ms. Longer inputs or complex queries may take more time.

Is my voice data stored?

No. Audio bytes are cleared from memory immediately after transcription. The transcript (the text version of what you said) may be passed to the orchestrator and logged for tracing purposes, consistent with how all other Haiven AI interactions are handled.

Can I use it without a microphone?

Yes. Use POST /voice/text to send typed text and get a spoken response back.

What languages are supported?

Language support depends on the haiven-transcribe backend. The default configuration uses Whisper Turbo, which supports a broad set of languages. Contact your administrator for specifics about your deployment.

What if the AI doesn't understand me?

Check the X-Intent response header to see what the orchestrator classified your request as. If transcription was poor, try recording in a quieter environment at 16kHz or higher. If the intent is wrong, rephrase more explicitly.

Can I keep a conversation going across multiple requests?

Yes. Generate a UUID for your session and pass it as session_id in each request. The orchestrator uses this to maintain context across turns. Sessions do not expire on a fixed timer, but the context window of the underlying LLM eventually limits how far back the AI can "remember."

The service shows "degraded" — what do I do?

Check which upstream is down (stt, tts, or orchestrator) in the health response. If it does not recover within a minute or two, contact your Haiven administrator.

I got a 409 error on the confirm flow.

There are two causes: the task status is not pending_review (it may have already been executed or cancelled), or a calendar event conflicts with an existing slot. The error detail field explains which.

The confirm flow returns 502 — what happened?

work-hub or the downstream service (email, calendar) returned an error. The action was not executed. Check with your administrator. The task remains in execution_failed state and may be retriable.

Can I cancel after confirming?

No. Once an action reaches executed state it cannot be undone through the gateway. For email, contact your email admin. For calendar, delete the event through your calendar client.