Internal MCP server that lets Claude Code offload bounded work to the local LiteLLM stack instead of using Claude quota.
The MCP server currently exposes three tools:
| Tool | Purpose | Default |
|---|---|---|
llm/complete |
Free-form text generation for bounded tasks | qwen3.5-35b-a3b |
llm/json |
Structured JSON output with JSON mode | gemma4-26b |
llm/models |
List MCP-surfaced chat models | — |
These are the models the current src/litellm-mcp source advertises through tool schemas:
| Model | Context | Best For |
|---|---|---|
qwen3.5-35b-a3b |
262K | Strong general-purpose completion |
gemma4-26b |
256K | Structured output, tool calling, multimodal reasoning |
hermes-4.3-36b |
199K | Creative writing and low-refusal prose |
glm-4-7-flash |
200K | Fast mechanical work, rewriting, code-adjacent tasks |
LiteLLM's live always-on inventory on 2026-04-18 is broader than the MCP tool surface:
| Runtime Model | Port | GPU | Notes |
|---|---|---|---|
glm-4-7-flash |
6000 | Alpha | 200K context |
qwen3.6-35b-a3b |
6003 | Bravo | General-purpose always-on model |
hermes-4.3-36b |
6004 | Charlie | Creative writing |
gemma4-26b |
6001 | Delta | Structured output, tool calling, vision |
gemma4-e4b |
6006 | Delta | Audio + image + video input |
medgemma-4b |
6005 | Delta | Medical vision |
qwen3-embedding-4b |
6002 | Alpha | Embeddings only, 8K runtime context |
Two important quirks:
qwen3.5-35b-a3b, but the live LiteLLM inventory advertises that backend as qwen3.6-35b-a3b.qwen3-embedding-4b, but the live vLLM backend and compose config are set to 8192.| Endpoint | Description |
|---|---|
GET /health |
Liveness check |
GET /metrics |
Prometheus metrics |
POST /mcp |
MCP JSON-RPC tool execution |
The service is internal-only on port 8769 and is not exposed through Traefik.
llm/complete defaults to qwen3.5-35b-a3b in the current source.llm/json defaults to gemma4-26b.llm/complete.llm/json.llm/models reports only the models surfaced by the MCP server, not the full LiteLLM inventory.# Build
docker compose -f /mnt/apps/docker/ai/litellm-mcp/docker-compose.yml build
# Start
docker compose -f /mnt/apps/docker/ai/litellm-mcp/docker-compose.yml up -d
# Health
curl -sf http://localhost:8769/health
# Logs
docker logs -f litellm-mcp
# Restart
docker compose -f /mnt/apps/docker/ai/litellm-mcp/docker-compose.yml restart
/mnt/apps/docker/ai/litellm-observability/litellm/config.yaml/mnt/apps/src/litellm-mcp//mnt/apps/docker/ai/litellm-mcp/openapi.yaml