This guide explains how to use the MCP Server tools through Echo (LibreChat) or direct API access.
| Method | URL | Use Case |
|---|---|---|
| Echo Chat | https://echo.haiven.site |
Interactive chat with tool access |
| Direct API | https://mcp.haiven.site |
Programmatic tool execution |
| External | https://mcp.haiven.site |
Remote access (authenticated) |
https://echo.haiven.siteCheck Docker Containers
Ask: "Show me all running containers"
Tool: status/docker
The response includes container name, status, health, uptime, and ports.
Check GPU Usage
Ask: "How much GPU memory is being used?"
Tool: status/gpu
Returns utilization percentage, memory used/total, and temperature for all GPUs:
- RTX PRO 6000 pro6000-charlie (96GB): Multimodal tasks (image gen, audio, ML)
- RTX PRO 6000 pro6000-alpha (96GB): LLM inference (primary), embedded Whisper
- RTX PRO 6000 pro6000-bravo (96GB): LLM inference (overflow)
Check System Resources
Ask: "What's the CPU and memory usage?"
Tool: status/system
Returns CPU%, memory GB, disk usage, and load averages.
List Available Models
Ask: "What LLM models can I use?"
Tool: status/models
This now queries LiteLLM, providing access to all 33 configured models including:
- Local GGUF models via llama-swap
- TTS models (Piper voices)
- STT models (Whisper)
Understanding Model Traces
All model calls are logged to Langfuse. To view traces:
1. Visit https://ai-ops.haiven.site
2. Navigate to Traces
3. Filter by service or model name
Read a Configuration File
Ask: "Show me the llama-swap config"
Tool: files/read
Path: /mnt/apps/docker/ai/llama-swap/config.yaml
Search for Files
Ask: "Find all docker-compose.yml files"
Tool: files/search
Pattern: docker-compose.yml
Query Loki Logs
Ask: "Show me errors from the last hour"
Tool: logs/query
View Container Logs
Ask: "Show me the last 50 lines of mcp-server logs"
Tool: docker/logs
Container: mcp-server
Lines: 50
Restart a Service
Ask: "Restart the echo container"
Tool: docker/restart
Container: echo
Note: Only allowed containers can be restarted (security feature).
Transcribe Audio
Ask: "Transcribe this audio file"
Tool: audio/transcribe
Or use the OpenAI-compatible API directly:
curl -X POST https://mcp.haiven.site/v1/audio/transcriptions \
-F "file=@meeting.mp3" \
-F "model=whisper-1"
Supported formats: mp3, wav, m4a, flac, ogg, webm
Search the Web
Ask: "Search for Python async best practices"
Tool: search/web
Scrape a Webpage
Ask: "Get the content from https://example.com"
Tool: scrape/url
Crawl a Website
Ask: "Crawl the documentation site and get all pages"
Tool: scrape/site
Use source_type instead of memorizing engine names. SearXNG engine CSVs are resolved automatically.
Tool: search/web
{
"queries": ["attention mechanism transformers", "flash attention benchmarks"],
"source_type": "academic",
"max_results": 8
}
Resolves to engines: google scholar,arxiv,pubmed,crossref,semantic scholar. Prefer this over setting engines manually — if both are passed, the response includes a diagnostics.warning.
Other presets:
| source_type | Engines used |
|---------------|-------------|
| general | SearXNG default mix |
| academic | google scholar, arxiv, pubmed, crossref, semantic scholar |
| news | google news, bing news, reuters, yahoo news |
| code | github, gitlab, stackoverflow, sourcehut |
| social | reddit, mastodon, lemmy |
| primary | wikipedia, wikidata, wikiquote, arxiv |
When snippets might be too short to trust, pass snippet_min_chars. Results below the threshold get snippet_too_short: true — your signal to call scrape/url for the full body.
Tool: search/web
{
"query": "vLLM speculative decoding throughput 2025",
"snippet_min_chars": 120
}
Example result entry:
{
"title": "Speculative Decoding in vLLM",
"url": "https://blog.vllm.ai/2025/...",
"snippet": "New draft model…",
"snippet_length": 18,
"snippet_too_short": true
}
snippet_too_short is absent when the snippet meets or exceeds the threshold. snippet_length is always present.
engine_status to Diagnose FailuresWhen a search returns few or no results, engine_status tells you whether engines actually responded.
Tool: search/web
{
"query": "some very obscure technical term",
"source_type": "academic"
}
Example response fields:
{
"results": [],
"engine_status": {
"arxiv": {"status": "ok", "result_count": 0},
"pubmed": {"status": "unresponsive", "reason": "timeout"}
},
"engines_failed": ["pubmed"]
}
engines_failed is empty → real zero hits, the query returned nothing.engines_failed is populated → partial outage; retry or broaden query before concluding no results exist.search/and_fetchPass body_max_chars=0 to disable per-article truncation and return the full markdown body up to 50,000 chars. Default is 2000.
Tool: search/and_fetch
{
"query": "Claude 4 reasoning capabilities",
"fetch_top_n": 3,
"body_max_chars": 0
}
Why use it: when you need the complete article text rather than a 2000-char summary — full changelogs, long papers, or detailed technical docs.
suppressed_global)When passing queries, results from all sub-queries are interleaved before the top-N pick to prevent any single domain from dominating.
Tool: search/and_fetch
{
"queries": [
"Qwen3 quantization techniques",
"Qwen3 inference benchmark results",
"Qwen3 deployment vLLM production"
],
"fetch_top_n": 5,
"max_per_domain": 2,
"source_type": "academic"
}
suppressed_global in the response lists URLs that were dropped by the cross-query dedup/per-domain-cap pass — useful for debugging why a particular source didn't appear in the final results.
Tool: scrape/batch
{
"seed_url": "https://docs.vllm.ai/",
"max_depth": 2,
"max_pages": 10,
"body_max_chars": 0
}
body_max_chars=0 removes the 2000-char-per-page truncation that was previously hardcoded. Set a finite limit (e.g. 8000) for large crawls to stay within byte budget.
All scrape tools (scrape/url, scrape/batch, scrape/site, scrape/sitemap) now return a uniform error envelope. Check error_category to decide whether to retry:
error_category |
Meaning | Action |
|---|---|---|
transient |
Timeout or connection failure | Worth retrying |
client |
Bad URL or invalid params | Fix the request, don't retry |
upstream |
Crawl4AI or remote page returned empty/error | Retry once; if persistent, try different URL |
Run a Bash Command
Ask: "List files in /mnt/apps/docker"
Tool: sandbox/bash
Command: ls -la /mnt/apps/docker
Run Python Code
Ask: "Calculate the sum of 1 to 100"
Tool: sandbox/python
Code: print(sum(range(1, 101)))
Note: Code runs in isolated sandboxes with no network access.
Goal: Get a complete overview of system health
Goal: Troubleshoot a misbehaving service
Goal: Understand available models and their usage
https://ai-ops.haiven.siteGoal: Transcribe a voice recording
Instead of trying to specify exact tool parameters, describe what you want:
| Less Effective | More Effective |
|---|---|
| "Run status/docker with show_stopped=true" | "Show me all containers including stopped ones" |
| "Execute files/read path=/mnt/apps/docker/traefik/traefik.yml" | "Show me the Traefik configuration" |
| Vague | Specific |
|---|---|
| "Check the system" | "What's the CPU usage and available memory?" |
| "Look at logs" | "Show me errors in the llama-swap logs from the last hour" |
The AI can chain multiple tools together:
- "Check if llama-swap is running, show its logs if there are errors, and restart it if needed"
- "Find all services using the PRO 6000 and show their memory usage"
https://ai-ops.haiven.siteEach trace shows:
- Input: The prompt sent to the model
- Output: The model's response
- Tokens: Input/output token counts
- Latency: Response time in milliseconds
- Cost: Estimated cost (if configured)
When something goes wrong:
1. Find the relevant trace by timestamp
2. Check the input for issues
3. Review the output for errors
4. Look at the latency for timeout issues
5. Check parent spans for context
Symptoms:
- mcp-server container exits with code 128
- Docker error: failed to initialize NVML: Driver Not Loaded
- nvidia-smi fails on the host
Root Cause:
mcp-server requires GPU access for audio transcription and other AI tasks. If the NVIDIA kernel modules aren't loaded (e.g., after a kernel upgrade without matching module rebuild), the container cannot initialize GPU access.
How to Verify:
nvidia-smi # Should show GPU info; failure means driver not loaded
uname -r # Check current kernel version
docker logs mcp-server 2>&1 | head -20 # Check for NVML errors
Fix:
# Install NVIDIA modules for current kernel and reboot
sudo apt update
sudo apt install linux-modules-nvidia-580-open-$(uname -r)
sudo reboot
# After reboot, verify and restart
nvidia-smi
cd /mnt/apps/docker/ai/mcp-server && docker compose up -d
Prevention: Install nvidia-dkms-580-open for automatic module rebuilds on kernel upgrades.
| Namespace | Tools |
|---|---|
| status | docker, gpu, models, system |
| docker | restart, logs |
| files | read, list, search |
| sandbox | bash, python |
| search | web |
| scrape | url, batch, site, sitemap |
| logs | query |
| metrics | query |
| alerts | list |
| audio | transcribe, speak |
| uptime | status |
| image | generate |
| Endpoint | Method | Purpose |
|---|---|---|
/health |
GET | Health check |
/tools |
GET | List tools |
/mcp |
POST | Execute tool |
/metrics |
GET | Prometheus metrics |
/v1/audio/transcriptions |
POST | STT |
/v1/audio/translations |
POST | Translation |
| What | Where |
|---|---|
| MCP Server | https://mcp.haiven.site |
| Echo Chat | https://echo.haiven.site |
| Langfuse | https://ai-ops.haiven.site |
| LiteLLM | https://llm.haiven.site |
| Grafana | https://grafana.haiven.site |
/mnt/apps/docker/ai/mcp-server/README.mdhttps://docs.haiven.sitestatus/docker tooldocker/logs tool for mcp-server