RAGAS evaluation service for the haiven-knowledge RAG pipeline. Measures retrieval quality using Context Precision and Faithfulness metrics against a curated golden dataset of infrastructure questions. Uses GLM-4.7-Flash as the judge LLM via LiteLLM.
| Property | Value |
|---|---|
| Status | Live |
| Version | 1.1.0 |
| Port | 8470 (internal only — no Traefik routing) |
| Domain | None |
| GPU | None (CPU-only) |
| Network | web (Docker internal) |
| Compose file | /mnt/apps/docker/ai/haiven-ragas/docker-compose.yml |
| Source code | /mnt/apps/src/haiven-ragas/ |
| Results storage | /mnt/apps/data/ragas/ (host) → /data/results (container) |
haiven-ragas wraps the RAGAS evaluation framework (v0.4.3) in a FastAPI service. On each evaluation run it:
POST /evaluate
|
+--> haiven-knowledge:8022/v1/search (retrieve context chunks, top_k=10)
|
+--> LiteLLM:4000 --> GLM-4.7-Flash (generate RAG answer per question)
|
+--> RAGAS evaluate() (ContextPrecision + Faithfulness)
| judge_llm: GLM-4.7-Flash via LiteLLM
| run_config: max_workers=2, timeout=300
|
+--> /data/results/<timestamp>[label].json (persist result)
|
+--> EvaluateResponse (scores, quality_gate pass/fail)
haiven-ragas has no public domain and is only reachable from containers on the web Docker network.
Measures whether retrieved context chunks are relevant to the ground-truth answer. The judge LLM scores each chunk as relevant or irrelevant; position-weighted averaging produces the final score.
Context Precision does not measure answer quality — it measures whether the right context was retrieved.
Measures whether the generated answer is factually grounded in the retrieved contexts. The judge LLM decomposes the answer into claims and checks each claim against the context.
A low Faithfulness score indicates the RAG answer is hallucinating — generating content not present in the retrieved chunks.
The golden dataset consists of 78 question/ground-truth pairs organized across five evaluation domains:
| Domain | Questions | Focus |
|---|---|---|
business |
15 | Infrastructure ops: GPU layout, service ports, storage, networking |
learning |
39 | LLM config, vLLM, embedding, STT/TTS, knowledge pipeline, RAGAS |
meta |
12 | Operator workflow, MCP tools, ingestion patterns, API usage |
personal |
6 | Meeting notes, calendar, task integration |
creative |
6 | TTS, STT audio handling, image generation |
Selected examples:
- "What GPU runs GLM-4.7-Flash?" → "Alpha GPU (GPU-eef2a28f, GPU index 3)"
- "What vLLM max-num-seqs is set for GLM?" → "4"
- "Why does GLM-4.7-Flash require a CodeFenceStrippingChatOpenAI wrapper in RAGAS?" → full explanation
The full dataset is defined in /mnt/apps/src/haiven-ragas/src/golden_dataset.py.
Note: The service was initially deployed against 25 questions (the business and part of learning domains). The dataset has since grown to 78 questions. The quality gate baseline of 0.672 was established at 25 questions.
GLM-4.7-Flash is used for two roles in each evaluation run:
enable_thinking: False to get answers in content without reasoning overheadCodeFenceStrippingChatOpenAI wrapper for ContextPrecision and Faithfulness metric scoringGLM-4.7-Flash consistently wraps its JSON output in markdown code fences (```json ... ```). RAGAS expects raw JSON. The CodeFenceStrippingChatOpenAI class in evaluator.py subclasses langchain_openai.ChatOpenAI and strips these fences in both _generate and _agenerate paths.
If the judge model is changed to a different LLM, verify the new model returns raw JSON or add a similar wrapper.
The vLLM instance serving GLM is configured with max-num-seqs=4. RAGAS defaults to more concurrent workers than this handles cleanly.
Fix applied: RunConfig(max_workers=2, timeout=300, max_retries=10, max_wait=60) — keeps concurrent LLM calls at 2, well within GLM's capacity.
GLM is called with extra_body: {"enable_thinking": False} so answers land in content. If content is None (rare GLM edge case), the code falls back to reasoning_content. This differs from Seed-36B, which uses chat_template_kwargs: {"thinking_budget": 0} — do not confuse the two parameter names.
RAGAS 0.4.3 returns result["context_precision"] and result["faithfulness"] as List[float] (one score per question row), not a scalar mean. Earlier RAGAS versions returned a scalar directly — this is a breaking change.
Required handling (from evaluator.py):
cp_scores_list = result["context_precision"] # List[float], not float
cp_valid = [s for s in cp_scores_list if not math.isnan(s)]
context_precision = sum(cp_valid) / len(cp_valid) if cp_valid else 0.0
NaN scores occur when the judge LLM fails to return parseable JSON. The service:
- Logs a warning for each NaN question
- Averages only valid scores
- Reports scored_count vs question_count in the result JSON
All settings use the RAGAS_ prefix (loaded from environment variables or .env):
| Variable | Default | Description |
|---|---|---|
RAGAS_KNOWLEDGE_URL |
http://haiven-knowledge:8022 |
haiven-knowledge search endpoint |
RAGAS_LITELLM_URL |
http://litellm:4000 |
LiteLLM proxy URL |
RAGAS_LITELLM_API_KEY |
"" |
LiteLLM master key |
RAGAS_JUDGE_MODEL |
glm-4-7-flash |
Model for RAG answer generation and RAGAS judging |
RAGAS_EMBED_MODEL |
qwen3-embedding-4b |
Embedding model (metadata, not used by current metrics) |
RAGAS_RESULTS_PATH |
/data/results |
Directory for result JSON files |
RAGAS_LANGFUSE_HOST |
http://langfuse-web:3000 |
Langfuse tracing endpoint |
RAGAS_LANGFUSE_PUBLIC_KEY |
"" |
Langfuse public key |
RAGAS_LANGFUSE_SECRET_KEY |
"" |
Langfuse secret key |
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Liveness probe — returns {"status":"ok"} |
| POST | /evaluate |
Run RAGAS evaluation (synchronous, 10-20 min) |
| GET | /results/latest |
Return the most recent result JSON |
See openapi.yaml for full request/response schemas.
| Metric | Hard Gate | Aspirational Target |
|---|---|---|
| Context Precision | >= 0.65 | >= 0.72 |
| Faithfulness | >= 0.50 | >= 0.85 |
Both metrics must pass the hard gate for quality_gate.passed to be true. The aspirational targets are informational only — reported in quality gate runs but not enforced.
The quality_gate module (src/quality_gate.py) and regression module (src/regression.py) provide CLI tools for scripted quality gate enforcement and baseline comparison. See USER_GUIDE.md for usage.
| Date | Context Precision | Faithfulness | Questions | Result |
|---|---|---|---|---|
| 2026-02-27 | 0.4496 | 0.9470 | 68/78 | FAILED (CP below 0.65 — dataset expanded to 78q) |
| 2026-02-24 | 0.672 | — | 25/25 | PASSED (original 25q baseline) |
The 0.672 Context Precision baseline was achieved on the original 25-question dataset after ingesting MEMORY.md into haiven-knowledge. When the dataset was expanded to 78 questions across 5 domains, Context Precision dropped to 0.4496 — indicating knowledge gaps in the expanded question set. Faithfulness measured 0.9470, well above the 0.50 gate, showing the RAG pipeline answers faithfully from whatever context it retrieves.
| Service | Role |
|---|---|
| haiven-knowledge (8022) | Provides retrieval contexts via POST /v1/search |
| LiteLLM (4000) | Routes judge LLM calls to GLM-4.7-Flash |
| GLM-4.7-Flash (vllm-glm-flash, port 6000, Alpha GPU) | Judge model and RAG answer generator |
| Langfuse (optional) | Trace logging |
Results are saved to /mnt/apps/data/ragas/ (host) → /data/results (container). Filenames use ISO timestamps:
2026-02-24T18-30-00.000.json
2026-02-24T18-30-00.000-post-memory-md-ingest.json # when label is set
Each result JSON contains:
- timestamp, label, judge_model, embed_model, knowledge_url
- question_count, scored_count
- scores: {context_precision: float, faithfulness: float}
- quality_gate: per-metric threshold, score, and pass/fail
- per_question: list of per-question detail including contexts preview, generated answer, and individual metric scores
Quality gate runs are saved separately as quality-gate-<YYYY-MM-DD-HHMM>.json in the same directory.
Regression reports are saved to /data/results/regression/ as <timestamp>-regression-report.md and .json.
# Start
cd /mnt/apps/docker/ai/haiven-ragas
docker compose up -d
# Rebuild after source code changes
docker compose up -d --build
# View logs
docker logs -f haiven-ragas
# Health check
curl http://localhost:8470/health
GLM wraps JSON in code fences — handled automatically by CodeFenceStrippingChatOpenAI. If you switch judge models, verify the new model returns raw JSON.
RAGAS 0.4.3 returns List[float], not scalar — never use float(result["context_precision"]) directly. Always filter NaN and average manually.
RunConfig max_workers must stay at 2 — increasing it overwhelms GLM's max-num-seqs=4 and causes timeouts. If GLM's max-num-seqs is increased, max_workers can be increased proportionally (to at most half of max-num-seqs).
Evaluation is synchronous and slow — 10-20 minutes for the full golden dataset (answer generation + two RAGAS metric passes). Do not set HTTP client timeouts below 25 minutes.
No context = score 0 — when haiven-knowledge returns no results for a question, a [no context retrieved] placeholder forces a score of 0. This surfaces retrieval gaps in the metric rather than silently skipping rows.
No healthcheck in compose — the container has no Docker healthcheck defined. docker ps shows Up immediately after start. The service is ready within a few seconds of startup (no heavy model loading).
/mnt/apps/src/haiven-ragas/
├── Dockerfile # python:3.12-slim, port 8470
├── requirements.txt # ragas, fastapi, langchain-openai, datasets
└── src/
├── main.py # FastAPI app, request/response models, endpoints
├── evaluator.py # RAGAS evaluation logic, CodeFenceStrippingChatOpenAI
├── config.py # Pydantic settings (RAGAS_ prefix)
├── golden_dataset.py # 78 question/ground_truth pairs across 5 domains
├── quality_gate.py # CLI: enforce hard gates, save report to disk
└── regression.py # CLI: baseline comparison, per-question delta report
| Service | Relationship |
|---|---|
| haiven-knowledge (port 8022) | Subject of evaluation — its retrieval quality is what's being measured |
| haiven-reranker (port 8460) | Used by haiven-knowledge — reranker quality affects Context Precision |
| LiteLLM (port 4000) | Routes all LLM calls for both answer generation and RAGAS judging |
| GLM-4.7-Flash (port 6000) | Judge model and RAG answer generator |