MCP Server User Guide

This guide explains how to use the MCP Server tools through Echo (LibreChat) or direct API access.

Getting Started

Access Methods

Method URL Use Case
Echo Chat https://echo.haiven.site Interactive chat with tool access
Direct API https://mcp.haiven.site Programmatic tool execution
External https://mcp.haiven.site Remote access (authenticated)

Quick Start

  1. Open Echo at https://echo.haiven.site
  2. Select an agent with MCP tools enabled
  3. Ask natural language questions like:
    - "What containers are running?"
    - "Show me GPU usage"
    - "What models are available?"

Common Use Cases

1. Monitoring System Status

Check Docker Containers

Ask: "Show me all running containers"
Tool: status/docker

The response includes container name, status, health, uptime, and ports.

Check GPU Usage

Ask: "How much GPU memory is being used?"
Tool: status/gpu

Returns utilization percentage, memory used/total, and temperature for all GPUs:
- RTX PRO 6000 pro6000-charlie (96GB): Multimodal tasks (image gen, audio, ML)
- RTX PRO 6000 pro6000-alpha (96GB): LLM inference (primary), embedded Whisper
- RTX PRO 6000 pro6000-bravo (96GB): LLM inference (overflow)

Check System Resources

Ask: "What's the CPU and memory usage?"
Tool: status/system

Returns CPU%, memory GB, disk usage, and load averages.

2. Working with Models

List Available Models

Ask: "What LLM models can I use?"
Tool: status/models

This now queries LiteLLM, providing access to all 33 configured models including:
- Local GGUF models via llama-swap
- TTS models (Piper voices)
- STT models (Whisper)

Understanding Model Traces

All model calls are logged to Langfuse. To view traces:
1. Visit https://ai-ops.haiven.site
2. Navigate to Traces
3. Filter by service or model name

3. Reading Files and Logs

Read a Configuration File

Ask: "Show me the llama-swap config"
Tool: files/read
Path: /mnt/apps/docker/ai/llama-swap/config.yaml

Search for Files

Ask: "Find all docker-compose.yml files"
Tool: files/search
Pattern: docker-compose.yml

Query Loki Logs

Ask: "Show me errors from the last hour"
Tool: logs/query

4. Docker Management

View Container Logs

Ask: "Show me the last 50 lines of mcp-server logs"
Tool: docker/logs
Container: mcp-server
Lines: 50

Restart a Service

Ask: "Restart the echo container"
Tool: docker/restart
Container: echo

Note: Only allowed containers can be restarted (security feature).

5. Audio Transcription

Transcribe Audio

Ask: "Transcribe this audio file"
Tool: audio/transcribe

Or use the OpenAI-compatible API directly:

curl -X POST https://mcp.haiven.site/v1/audio/transcriptions \
  -F "file=@meeting.mp3" \
  -F "model=whisper-1"

Supported formats: mp3, wav, m4a, flac, ogg, webm

6. Web Search and Scraping

Search the Web

Ask: "Search for Python async best practices"
Tool: search/web

Scrape a Webpage

Ask: "Get the content from https://example.com"
Tool: scrape/url

Crawl a Website

Ask: "Crawl the documentation site and get all pages"
Tool: scrape/site

v2 Search Capabilities (2026-05-01)

Use source_type instead of memorizing engine names. SearXNG engine CSVs are resolved automatically.

Tool: search/web

{
  "queries": ["attention mechanism transformers", "flash attention benchmarks"],
  "source_type": "academic",
  "max_results": 8
}

Resolves to engines: google scholar,arxiv,pubmed,crossref,semantic scholar. Prefer this over setting engines manually — if both are passed, the response includes a diagnostics.warning.

Other presets:
| source_type | Engines used |
|---------------|-------------|
| general | SearXNG default mix |
| academic | google scholar, arxiv, pubmed, crossref, semantic scholar |
| news | google news, bing news, reuters, yahoo news |
| code | github, gitlab, stackoverflow, sourcehut |
| social | reddit, mastodon, lemmy |
| primary | wikipedia, wikidata, wikiquote, arxiv |


Snippet Quality Signal

When snippets might be too short to trust, pass snippet_min_chars. Results below the threshold get snippet_too_short: true — your signal to call scrape/url for the full body.

Tool: search/web

{
  "query": "vLLM speculative decoding throughput 2025",
  "snippet_min_chars": 120
}

Example result entry:

{
  "title": "Speculative Decoding in vLLM",
  "url": "https://blog.vllm.ai/2025/...",
  "snippet": "New draft model…",
  "snippet_length": 18,
  "snippet_too_short": true
}

snippet_too_short is absent when the snippet meets or exceeds the threshold. snippet_length is always present.


Reading engine_status to Diagnose Failures

When a search returns few or no results, engine_status tells you whether engines actually responded.

Tool: search/web

{
  "query": "some very obscure technical term",
  "source_type": "academic"
}

Example response fields:

{
  "results": [],
  "engine_status": {
    "arxiv": {"status": "ok", "result_count": 0},
    "pubmed": {"status": "unresponsive", "reason": "timeout"}
  },
  "engines_failed": ["pubmed"]
}

Full-Body Extraction with search/and_fetch

Pass body_max_chars=0 to disable per-article truncation and return the full markdown body up to 50,000 chars. Default is 2000.

Tool: search/and_fetch

{
  "query": "Claude 4 reasoning capabilities",
  "fetch_top_n": 3,
  "body_max_chars": 0
}

Why use it: when you need the complete article text rather than a 2000-char summary — full changelogs, long papers, or detailed technical docs.


Batch Multi-Query with Diversity (suppressed_global)

When passing queries, results from all sub-queries are interleaved before the top-N pick to prevent any single domain from dominating.

Tool: search/and_fetch

{
  "queries": [
    "Qwen3 quantization techniques",
    "Qwen3 inference benchmark results",
    "Qwen3 deployment vLLM production"
  ],
  "fetch_top_n": 5,
  "max_per_domain": 2,
  "source_type": "academic"
}

suppressed_global in the response lists URLs that were dropped by the cross-query dedup/per-domain-cap pass — useful for debugging why a particular source didn't appear in the final results.


Scrape Batch with Full Page Bodies

Tool: scrape/batch

{
  "seed_url": "https://docs.vllm.ai/",
  "max_depth": 2,
  "max_pages": 10,
  "body_max_chars": 0
}

body_max_chars=0 removes the 2000-char-per-page truncation that was previously hardcoded. Set a finite limit (e.g. 8000) for large crawls to stay within byte budget.


Understanding Scrape Error Categories

All scrape tools (scrape/url, scrape/batch, scrape/site, scrape/sitemap) now return a uniform error envelope. Check error_category to decide whether to retry:

error_category Meaning Action
transient Timeout or connection failure Worth retrying
client Bad URL or invalid params Fix the request, don't retry
upstream Crawl4AI or remote page returned empty/error Retry once; if persistent, try different URL

7. Code Execution

Run a Bash Command

Ask: "List files in /mnt/apps/docker"
Tool: sandbox/bash
Command: ls -la /mnt/apps/docker

Run Python Code

Ask: "Calculate the sum of 1 to 100"
Tool: sandbox/python
Code: print(sum(range(1, 101)))

Note: Code runs in isolated sandboxes with no network access.

Tutorials

Tutorial 1: Infrastructure Health Check

Goal: Get a complete overview of system health

  1. Check containers: "Show me all container statuses"
  2. Check GPUs: "What's the GPU memory usage?"
  3. Check resources: "Show detailed system metrics"
  4. Check alerts: "Are there any active alerts?"
  5. Check uptime: "What services are down in Uptime Kuma?"

Tutorial 2: Debugging a Service

Goal: Troubleshoot a misbehaving service

  1. Check status: "Is the echo container healthy?"
  2. View logs: "Show me the last 100 lines of echo logs"
  3. Check resources: "What's the memory usage?"
  4. Search for errors: "Search echo logs for 'error' in the last hour"
  5. Restart if needed: "Restart the echo container"

Tutorial 3: Model Exploration

Goal: Understand available models and their usage

  1. List models: "What models are available?"
  2. Check traces: Visit https://ai-ops.haiven.site
  3. View costs: Check the Langfuse dashboard for token usage
  4. Compare latency: Filter traces by model to compare response times

Tutorial 4: Audio Processing

Goal: Transcribe a voice recording

  1. Upload audio file to Echo or use API
  2. Transcribe: "Transcribe this audio file"
  3. Review output: Check the transcription text
  4. Translate if needed: Use audio/translations for non-English

Tips and Best Practices

Natural Language Works Best

Instead of trying to specify exact tool parameters, describe what you want:

Less Effective More Effective
"Run status/docker with show_stopped=true" "Show me all containers including stopped ones"
"Execute files/read path=/mnt/apps/docker/traefik/traefik.yml" "Show me the Traefik configuration"

Use Specific Questions for Better Results

Vague Specific
"Check the system" "What's the CPU usage and available memory?"
"Look at logs" "Show me errors in the llama-swap logs from the last hour"

Combine Tools for Complex Tasks

The AI can chain multiple tools together:
- "Check if llama-swap is running, show its logs if there are errors, and restart it if needed"
- "Find all services using the PRO 6000 and show their memory usage"

Security Awareness

Observability with Langfuse

Viewing Your Traces

  1. Open https://ai-ops.haiven.site
  2. Log in with your credentials
  3. Navigate to Traces in the sidebar
  4. Filter by:
    - Time range
    - Model name
    - User ID
    - Tags

Understanding Trace Data

Each trace shows:
- Input: The prompt sent to the model
- Output: The model's response
- Tokens: Input/output token counts
- Latency: Response time in milliseconds
- Cost: Estimated cost (if configured)

Debugging with Traces

When something goes wrong:
1. Find the relevant trace by timestamp
2. Check the input for issues
3. Review the output for errors
4. Look at the latency for timeout issues
5. Check parent spans for context

Troubleshooting

Service Won't Start - NVIDIA GPU Driver Not Loaded

Symptoms:
- mcp-server container exits with code 128
- Docker error: failed to initialize NVML: Driver Not Loaded
- nvidia-smi fails on the host

Root Cause:
mcp-server requires GPU access for audio transcription and other AI tasks. If the NVIDIA kernel modules aren't loaded (e.g., after a kernel upgrade without matching module rebuild), the container cannot initialize GPU access.

How to Verify:

nvidia-smi  # Should show GPU info; failure means driver not loaded
uname -r    # Check current kernel version
docker logs mcp-server 2>&1 | head -20  # Check for NVML errors

Fix:

# Install NVIDIA modules for current kernel and reboot
sudo apt update
sudo apt install linux-modules-nvidia-580-open-$(uname -r)
sudo reboot

# After reboot, verify and restart
nvidia-smi
cd /mnt/apps/docker/ai/mcp-server && docker compose up -d

Prevention: Install nvidia-dkms-580-open for automatic module rebuilds on kernel upgrades.


"Tool not found"

"Permission denied"

"Connection refused"

"Timeout"

Audio transcription fails

Quick Reference

Tool Namespaces

Namespace Tools
status docker, gpu, models, system
docker restart, logs
files read, list, search
sandbox bash, python
search web
scrape url, batch, site, sitemap
logs query
metrics query
alerts list
audio transcribe, speak
uptime status
image generate

API Endpoints

Endpoint Method Purpose
/health GET Health check
/tools GET List tools
/mcp POST Execute tool
/metrics GET Prometheus metrics
/v1/audio/transcriptions POST STT
/v1/audio/translations POST Translation

Environment Quick Reference

What Where
MCP Server https://mcp.haiven.site
Echo Chat https://echo.haiven.site
Langfuse https://ai-ops.haiven.site
LiteLLM https://llm.haiven.site
Grafana https://grafana.haiven.site

Getting Help

  1. Check the README: /mnt/apps/docker/ai/mcp-server/README.md
  2. View API docs: https://docs.haiven.site
  3. Check service status: Use status/docker tool
  4. View logs: Use docker/logs tool for mcp-server