litellm-mcp

Internal MCP server that lets Claude Code offload bounded work to the local LiteLLM stack instead of using Claude quota.

What It Exposes

The MCP server currently exposes three tools:

Tool Purpose Default
llm/complete Free-form text generation for bounded tasks qwen3.5-35b-a3b
llm/json Structured JSON output with JSON mode gemma4-26b
llm/models List MCP-surfaced chat models

MCP-Surfaced Models

These are the models the current src/litellm-mcp source advertises through tool schemas:

Model Context Best For
qwen3.5-35b-a3b 262K Strong general-purpose completion
gemma4-26b 256K Structured output, tool calling, multimodal reasoning
hermes-4.3-36b 199K Creative writing and low-refusal prose
glm-4-7-flash 200K Fast mechanical work, rewriting, code-adjacent tasks

Runtime Inventory Note

LiteLLM's live always-on inventory on 2026-04-18 is broader than the MCP tool surface:

Runtime Model Port GPU Notes
glm-4-7-flash 6000 Alpha 200K context
qwen3.6-35b-a3b 6003 Bravo General-purpose always-on model
hermes-4.3-36b 6004 Charlie Creative writing
gemma4-26b 6001 Delta Structured output, tool calling, vision
gemma4-e4b 6006 Delta Audio + image + video input
medgemma-4b 6005 Delta Medical vision
qwen3-embedding-4b 6002 Alpha Embeddings only, 8K runtime context

Two important quirks:

Endpoints

Endpoint Description
GET /health Liveness check
GET /metrics Prometheus metrics
POST /mcp MCP JSON-RPC tool execution

The service is internal-only on port 8769 and is not exposed through Traefik.

Default Behavior

Operations

# Build
docker compose -f /mnt/apps/docker/ai/litellm-mcp/docker-compose.yml build

# Start
docker compose -f /mnt/apps/docker/ai/litellm-mcp/docker-compose.yml up -d

# Health
curl -sf http://localhost:8769/health

# Logs
docker logs -f litellm-mcp

# Restart
docker compose -f /mnt/apps/docker/ai/litellm-mcp/docker-compose.yml restart

Source Of Truth