This guide covers how to use the haven-voice-gateway service: sending voice to the AI, capturing voice notes, getting spoken responses, and confirming or cancelling pending actions such as email sends and calendar events. It is written for users of the Haiven platform, not for operators managing the service.

What Is This?

haven-voice-gateway is the voice interface for the Haiven AI. You speak (or type), and Haiven speaks back. When the AI proposes an action — sending an email, creating a calendar event — you can confirm or cancel it by voice before anything is sent.

Under the hood it connects four pieces: speech recognition (to hear you), the AI orchestrator (to think and act), speech synthesis (to reply), and work-hub (to execute confirmed actions). From your perspective, you send audio in and get audio or a confirmation back.

Getting Access

The gateway lives at https://voice.haiven.site. It is protected by Authentik SSO — you must be signed in to the Haiven platform to use it. If you hit the URL in a browser without a session, you will be redirected to the Authentik login page.

Connection details:
- URL: https://voice.haiven.site (HTTPS required)
- Certificate: Self-signed (internal TLS)
- Authentication: Authentik SSO (middleware: authentik-secure-chain@file)
- HTTP redirect: Plain HTTP requests are automatically redirected to HTTPS with a permanent 301 redirect

All API requests from code also require a valid session cookie or a service token from Authentik.

When using curl or HTTP clients with self-signed certificates, you may need to add --insecure (or equivalent) to skip certificate verification, or import the Haiven internal CA certificate.

Interaction Modes

Mode 1: Voice-to-Voice

Mode	When to Use
Voice-to-voice	You have a recording and want a spoken reply
Text-to-voice	You want to type something and hear the response
Voice note	You want to capture a spoken note without needing an audio reply
Confirm preview	You want to hear what a pending action contains before deciding
Confirm action	You want to approve or cancel a pending email send or calendar create

POST https://voice.haiven.site/voice
Content-Type: multipart/form-data

Fields:
  file       — your audio recording (WAV, MP3, FLAC, OGG, or M4A)
  session_id — (optional) a UUID to keep conversation context across turns

curl -X POST https://voice.haiven.site/voice \
  -F "file=@my_question.wav" \
  --output haiven_reply.wav

curl -X POST https://voice.haiven.site/voice \
  -F "file=@followup.wav" \
  -F "session_id=550e8400-e29b-41d4-a716-446655440000" \
  --output reply2.wav

A streaming WAV file. Play it with any audio player (aplay, ffplay, VLC, etc.):

curl -X POST https://voice.haiven.site/voice \
  -F "file=@question.wav" | aplay -

Header	Meaning
`X-Total-Latency-Ms`	Full round-trip time
`X-STT-Latency-Ms`	Time to transcribe your audio
`X-Orch-Latency-Ms`	Time for the AI to produce a response
`X-TTS-Latency-Ms`	Time to synthesize the spoken reply
`X-Intent`	What kind of request the AI detected
`X-Request-Id`	Unique ID for this request (useful for troubleshooting)

curl -X POST https://voice.haiven.site/voice \
  -F "file=@question.wav" \
  --output reply.wav \
  -D /dev/stderr

Typical latency: Around 2.5 seconds end-to-end on a warm system. First request of the day may be slower if upstream services are loading models.

Your audio is private: The gateway zeroes your audio bytes in memory immediately after transcription. Nothing is saved to disk.

Mode 2: Text-to-Voice

POST https://voice.haiven.site/voice/text
Content-Type: application/json

{
  "text": "What's the weather like in New York today?",
  "session_id": "optional-uuid"
}

curl -X POST https://voice.haiven.site/voice/text \
  -H "Content-Type: application/json" \
  -d '{"text": "Summarize my tasks for today"}' \
  --output reply.wav

curl -s -X POST https://voice.haiven.site/voice/text \
  -H "Content-Type: application/json" \
  -d '{"text": "Tell me something interesting"}' | aplay -

The response is the same streaming WAV format as voice-to-voice, with the same latency headers (minus the STT stage, since there is no audio to transcribe).

SESSION="$(python3 -c 'import uuid; print(uuid.uuid4())')"

# First turn
curl -s -X POST https://voice.haiven.site/voice/text \
  -H "Content-Type: application/json" \
  -d "{\"text\": \"What's on my calendar?\", \"session_id\": \"$SESSION\"}" \
  | aplay -

# Follow-up in the same session
curl -s -X POST https://voice.haiven.site/voice/text \
  -H "Content-Type: application/json" \
  -d "{\"text\": \"Move the 3pm meeting to 4pm\", \"session_id\": \"$SESSION\"}" \
  | aplay -

Mode 3: Voice Note

Capture a spoken note. The AI transcribes it and stores it — no spoken reply is sent back.

This mode is useful when you want to quickly dictate something (a task, a decision, an observation) and have it ingested without sitting through audio playback.

POST https://voice.haiven.site/voice/note
Content-Type: multipart/form-data

Fields:
  file — your audio recording

curl -X POST https://voice.haiven.site/voice/note \
  -F "file=@note.wav"

{
  "transcript": "The vendor agreed to 30 days net terms",
  "ingested": true,
  "message": "Note recorded."
}

Field	Meaning
`transcript`	Exactly what the AI heard
`ingested`	`true` if the note was successfully stored
`message`	Human-readable confirmation or orchestrator response

If ingested is false, the transcript is still returned but something prevented the note from being stored — check with your Haiven administrator.

curl -s -X POST https://voice.haiven.site/voice/note \
  -F "file=@note.wav" | jq '.transcript'

Mode 4: Confirm Flow

When the AI orchestrator prepares an action that requires your explicit approval — such as sending an email or creating a calendar event — it creates a pending task in work-hub and returns the task ID. The confirm flow lets you review that pending action and approve or cancel it.

This two-step process (preview, then act) ensures you always know exactly what will be sent or created before it happens.

Step 1: Get a Preview

Fetch a human-readable summary of the pending action. The gateway formats it specifically for TTS readback so you can hear it aloud.

GET https://voice.haiven.site/confirm/pending/{task_id}

curl -s https://voice.haiven.site/confirm/pending/abc-123 | jq .

{
  "task_id": "abc-123",
  "artifact_type": "email",
  "tts_preview": "Email to alice@example.com. Subject: Q2 report. Preview: Here is the summary you requested.",
  "status": "pending_review"
}

{
  "task_id": "abc-123",
  "artifact_type": "calendar",
  "tts_preview": "Calendar event: Team sync at 2026-03-02T14:00:00 with alice@example.com, bob@example.com.",
  "status": "pending_review"
}

The tts_preview field is what the voice interface will read aloud. For emails it includes the recipient, subject, and the first two sentences of the body. For calendar events it includes the title, time, and up to three attendees.

Step 2: Confirm or Cancel

POST https://voice.haiven.site/confirm/action
Content-Type: application/json

{
  "task_id": "abc-123",
  "action": "confirm"   ← or "cancel"
}

curl -s -X POST https://voice.haiven.site/confirm/action \
  -H "Content-Type: application/json" \
  -d '{"task_id": "abc-123", "action": "confirm"}' | jq .

{
  "status": "executed",
  "action": "confirm",
  "detail": "Email sent. Message ID: msg_xyz",
  "tts_preview": "Email to alice@example.com. Subject: Q2 report. Preview: ..."
}

curl -s -X POST https://voice.haiven.site/confirm/action \
  -H "Content-Type: application/json" \
  -d '{"task_id": "abc-123", "action": "cancel"}' | jq .

{
  "status": "cancelled",
  "action": "cancel",
  "detail": "Email cancelled.",
  "tts_preview": ""
}

Pending type	Action taken
Email draft	Sends the email via work-hub
Calendar event	Creates the event via work-hub
Generic draft	Marks as approved (no external action)

Important: Each task can only be confirmed or cancelled once. If you try to act on a task that's already been executed or cancelled, you'll get a 409 error.

Checking Service Status

curl -s https://voice.haiven.site/health | jq .

{
  "status": "healthy",
  "stt": "up",
  "tts": "up",
  "orchestrator": "up"
}

If you see "status": "degraded", one of the upstream AI services (transcription, orchestrator, or TTS) is temporarily unavailable. The individual fields (stt, tts, orchestrator) identify which one. Try again in a minute, or contact your Haiven administrator.

Note: the health check does not include work-hub. If confirm flow endpoints return errors but /health shows healthy, work-hub may be the issue.

Audio Format Notes

The gateway accepts any audio format that haiven-transcribe can handle. Supported formats include:

Recommended recording settings for best transcription accuracy:
- Sample rate: 16kHz or higher
- Channels: Mono (stereo works but adds no accuracy benefit)
- Bit depth: 16-bit PCM (for WAV)

If transcription results are poor, try recording at a higher sample rate or with less background noise.

Building a Simple Client

import httpx

BASE = "https://voice.haiven.site"


def ask_voice(audio_path: str, session_id: str | None = None) -> bytes:
    """Send an audio file and return the WAV response bytes."""
    with open(audio_path, "rb") as f:
        data = {"session_id": session_id} if session_id else {}
        r = httpx.post(
            f"{BASE}/voice",
            files={"file": f},
            data=data,
            timeout=30.0,
        )
        r.raise_for_status()
        print("Intent:", r.headers.get("X-Intent"))
        print("Total latency:", r.headers.get("X-Total-Latency-Ms"), "ms")
        return r.content


def ask_text(text: str, session_id: str | None = None) -> bytes:
    """Send text and return the WAV response bytes."""
    payload = {"text": text}
    if session_id:
        payload["session_id"] = session_id
    r = httpx.post(f"{BASE}/voice/text", json=payload, timeout=30.0)
    r.raise_for_status()
    return r.content


def capture_note(audio_path: str) -> dict:
    """Capture a voice note and return the JSON confirmation."""
    with open(audio_path, "rb") as f:
        r = httpx.post(f"{BASE}/voice/note", files={"file": f}, timeout=30.0)
        r.raise_for_status()
        return r.json()


def confirm_preview(task_id: str) -> dict:
    """Fetch a TTS-friendly preview of a pending action."""
    r = httpx.get(f"{BASE}/confirm/pending/{task_id}", timeout=10.0)
    r.raise_for_status()
    return r.json()


def confirm_action(task_id: str, action: str = "confirm") -> dict:
    """Confirm or cancel a pending action. action must be 'confirm' or 'cancel'."""
    r = httpx.post(
        f"{BASE}/confirm/action",
        json={"task_id": task_id, "action": action},
        timeout=30.0,
    )
    r.raise_for_status()
    return r.json()


# Example usage
wav_bytes = ask_text("What are my tasks today?")
with open("reply.wav", "wb") as f:
    f.write(wav_bytes)

note = capture_note("meeting_note.wav")
print(note["transcript"])

# Confirm flow
preview = confirm_preview("abc-123")
print("Preview:", preview["tts_preview"])
result = confirm_action("abc-123", action="confirm")
print("Result:", result["detail"])

Frequently Asked Questions

On a warm system (all services running with models loaded), expect around 2–3 seconds end-to-end for a typical short question. The speech recognition stage takes roughly 800ms, the AI orchestrator around 1.5 seconds, and TTS synthesis around 120ms. Longer inputs or complex queries may take more time.

No. Audio bytes are cleared from memory immediately after transcription. The transcript (the text version of what you said) may be passed to the orchestrator and logged for tracing purposes, consistent with how all other Haiven AI interactions are handled.

Language support depends on the haiven-transcribe backend. The default configuration uses Whisper Turbo, which supports a broad set of languages. Contact your administrator for specifics about your deployment.

Check the X-Intent response header to see what the orchestrator classified your request as. If transcription was poor, try recording in a quieter environment at 16kHz or higher. If the intent is wrong, rephrase more explicitly.

Yes. Generate a UUID for your session and pass it as session_id in each request. The orchestrator uses this to maintain context across turns. Sessions do not expire on a fixed timer, but the context window of the underlying LLM eventually limits how far back the AI can "remember."

Check which upstream is down (stt, tts, or orchestrator) in the health response. If it does not recover within a minute or two, contact your Haiven administrator.

There are two causes: the task status is not pending_review (it may have already been executed or cancelled), or a calendar event conflicts with an existing slot. The error detail field explains which.

work-hub or the downstream service (email, calendar) returned an error. The action was not executed. Check with your administrator. The task remains in execution_failed state and may be retriable.

No. Once an action reaches executed state it cannot be undone through the gateway. For email, contact your email admin. For calendar, delete the event through your calendar client.

haven-voice-gateway — User Guide