litellm-mcp

Internal MCP server that lets Claude Code offload bounded work to the local LiteLLM stack instead of using Claude quota.

What It Exposes

The MCP server currently exposes three tools:

Tool	Purpose	Default
`llm/complete`	Free-form text generation for bounded tasks	`qwen3.5-35b-a3b`
`llm/json`	Structured JSON output with JSON mode	`gemma4-26b`
`llm/models`	List MCP-surfaced chat models	—

MCP-Surfaced Models

These are the models the current src/litellm-mcp source advertises through tool schemas:

Model	Context	Best For
`qwen3.5-35b-a3b`	262K	Strong general-purpose completion
`gemma4-26b`	256K	Structured output, tool calling, multimodal reasoning
`hermes-4.3-36b`	199K	Creative writing and low-refusal prose
`glm-4-7-flash`	200K	Fast mechanical work, rewriting, code-adjacent tasks

Runtime Inventory Note

LiteLLM's live always-on inventory on 2026-04-18 is broader than the MCP tool surface:

Runtime Model	Port	GPU	Notes
`glm-4-7-flash`	6000	Alpha	200K context
`qwen3.6-35b-a3b`	6003	Bravo	General-purpose always-on model
`hermes-4.3-36b`	6004	Charlie	Creative writing
`gemma4-26b`	6001	Delta	Structured output, tool calling, vision
`gemma4-e4b`	6006	Delta	Audio + image + video input
`medgemma-4b`	6005	Delta	Medical vision
`qwen3-embedding-4b`	6002	Alpha	Embeddings only, 8K runtime context

Two important quirks:

The MCP source still names the Bravo chat model qwen3.5-35b-a3b, but the live LiteLLM inventory advertises that backend as qwen3.6-35b-a3b.
LiteLLM metadata still reports a larger context window for qwen3-embedding-4b, but the live vLLM backend and compose config are set to 8192.

Endpoints

Endpoint	Description
`GET /health`	Liveness check
`GET /metrics`	Prometheus metrics
`POST /mcp`	MCP JSON-RPC tool execution

The service is internal-only on port 8769 and is not exposed through Traefik.

Default Behavior

llm/complete defaults to qwen3.5-35b-a3b in the current source.
llm/json defaults to gemma4-26b.
Thinking is off by default for llm/complete.
Thinking is always off for llm/json.
llm/models reports only the models surfaced by the MCP server, not the full LiteLLM inventory.

Operations

# Build
docker compose -f /mnt/apps/docker/ai/litellm-mcp/docker-compose.yml build

# Start
docker compose -f /mnt/apps/docker/ai/litellm-mcp/docker-compose.yml up -d

# Health
curl -sf http://localhost:8769/health

# Logs
docker logs -f litellm-mcp

# Restart
docker compose -f /mnt/apps/docker/ai/litellm-mcp/docker-compose.yml restart

Source Of Truth

Runtime model inventory: /mnt/apps/docker/ai/litellm-observability/litellm/config.yaml
MCP source: /mnt/apps/src/litellm-mcp/
Service OpenAPI: /mnt/apps/docker/ai/litellm-mcp/openapi.yaml