haiven-ragas

RAGAS evaluation service for the haiven-knowledge RAG pipeline. Measures retrieval quality using Context Precision and Faithfulness metrics against a curated golden dataset of infrastructure questions. Uses GLM-4.7-Flash as the judge LLM via LiteLLM.

Overview

Property	Value
Status	Live
Version	1.1.0
Port	8470 (internal only — no Traefik routing)
Domain	None
GPU	None (CPU-only)
Network	`web` (Docker internal)
Compose file	`/mnt/apps/docker/ai/haiven-ragas/docker-compose.yml`
Source code	`/mnt/apps/src/haiven-ragas/`
Results storage	`/mnt/apps/data/ragas/` (host) → `/data/results` (container)

Architecture

haiven-ragas wraps the RAGAS evaluation framework (v0.4.3) in a FastAPI service. On each evaluation run it:

Queries haiven-knowledge for context chunks for every golden question
Generates a RAG answer using GLM-4.7-Flash (thinking disabled for speed)
Passes the question/contexts/ground_truth/answer tuple through RAGAS ContextPrecision and Faithfulness metrics
Persists a timestamped result JSON and returns aggregated scores with quality gate verdict

POST /evaluate
    |
    +--> haiven-knowledge:8022/v1/search      (retrieve context chunks, top_k=10)
    |
    +--> LiteLLM:4000 --> GLM-4.7-Flash       (generate RAG answer per question)
    |
    +--> RAGAS evaluate()                      (ContextPrecision + Faithfulness)
    |        judge_llm: GLM-4.7-Flash via LiteLLM
    |        run_config: max_workers=2, timeout=300
    |
    +--> /data/results/<timestamp>[label].json  (persist result)
    |
    +--> EvaluateResponse (scores, quality_gate pass/fail)

haiven-ragas has no public domain and is only reachable from containers on the web Docker network.

RAGAS Metrics

Context Precision

Measures whether retrieved context chunks are relevant to the ground-truth answer. The judge LLM scores each chunk as relevant or irrelevant; position-weighted averaging produces the final score.

Score range: 0.0 (all retrieved context irrelevant) to 1.0 (all retrieved context perfectly relevant)
Quality gate (hard): >= 0.65
Aspirational target: >= 0.72
Current baseline: 0.672 (PASSED — established 2026-02-24 with 25/25 questions scored)

Context Precision does not measure answer quality — it measures whether the right context was retrieved.

Faithfulness

Measures whether the generated answer is factually grounded in the retrieved contexts. The judge LLM decomposes the answer into claims and checks each claim against the context.

Score range: 0.0 (answer contradicts or ignores context) to 1.0 (every claim supported by context)
Quality gate (hard): >= 0.50
Aspirational target: >= 0.85

A low Faithfulness score indicates the RAG answer is hallucinating — generating content not present in the retrieved chunks.

Golden Dataset

The golden dataset consists of 78 question/ground-truth pairs organized across five evaluation domains:

Domain	Questions	Focus
`business`	15	Infrastructure ops: GPU layout, service ports, storage, networking
`learning`	39	LLM config, vLLM, embedding, STT/TTS, knowledge pipeline, RAGAS
`meta`	12	Operator workflow, MCP tools, ingestion patterns, API usage
`personal`	6	Meeting notes, calendar, task integration
`creative`	6	TTS, STT audio handling, image generation

Selected examples:
- "What GPU runs GLM-4.7-Flash?" → "Alpha GPU (GPU-eef2a28f, GPU index 3)"
- "What vLLM max-num-seqs is set for GLM?" → "4"
- "Why does GLM-4.7-Flash require a CodeFenceStrippingChatOpenAI wrapper in RAGAS?" → full explanation

The full dataset is defined in /mnt/apps/src/haiven-ragas/src/golden_dataset.py.

Note: The service was initially deployed against 25 questions (the business and part of learning domains). The dataset has since grown to 78 questions. The quality gate baseline of 0.672 was established at 25 questions.

Judge Model: GLM-4.7-Flash

GLM-4.7-Flash is used for two roles in each evaluation run:

RAG answer generation — called directly via LiteLLM with enable_thinking: False to get answers in content without reasoning overhead
RAGAS judge — called via CodeFenceStrippingChatOpenAI wrapper for ContextPrecision and Faithfulness metric scoring

GLM Behavior: Code Fence Wrapping

GLM-4.7-Flash consistently wraps its JSON output in markdown code fences (```json ... ```). RAGAS expects raw JSON. The CodeFenceStrippingChatOpenAI class in evaluator.py subclasses langchain_openai.ChatOpenAI and strips these fences in both _generate and _agenerate paths.

If the judge model is changed to a different LLM, verify the new model returns raw JSON or add a similar wrapper.

GLM Behavior: max-num-seqs Limit

The vLLM instance serving GLM is configured with max-num-seqs=4. RAGAS defaults to more concurrent workers than this handles cleanly.

Fix applied: RunConfig(max_workers=2, timeout=300, max_retries=10, max_wait=60) — keeps concurrent LLM calls at 2, well within GLM's capacity.

GLM Behavior: RAG Answer Generation

GLM is called with extra_body: {"enable_thinking": False} so answers land in content. If content is None (rare GLM edge case), the code falls back to reasoning_content. This differs from Seed-36B, which uses chat_template_kwargs: {"thinking_budget": 0} — do not confuse the two parameter names.

RAGAS 0.4.3: Score Handling

RAGAS 0.4.3 returns result["context_precision"] and result["faithfulness"] as List[float] (one score per question row), not a scalar mean. Earlier RAGAS versions returned a scalar directly — this is a breaking change.

Required handling (from evaluator.py):

cp_scores_list = result["context_precision"]           # List[float], not float
cp_valid = [s for s in cp_scores_list if not math.isnan(s)]
context_precision = sum(cp_valid) / len(cp_valid) if cp_valid else 0.0

NaN scores occur when the judge LLM fails to return parseable JSON. The service:
- Logs a warning for each NaN question
- Averages only valid scores
- Reports scored_count vs question_count in the result JSON

Configuration

All settings use the RAGAS_ prefix (loaded from environment variables or .env):

Variable	Default	Description
`RAGAS_KNOWLEDGE_URL`	`http://haiven-knowledge:8022`	haiven-knowledge search endpoint
`RAGAS_LITELLM_URL`	`http://litellm:4000`	LiteLLM proxy URL
`RAGAS_LITELLM_API_KEY`	`""`	LiteLLM master key
`RAGAS_JUDGE_MODEL`	`glm-4-7-flash`	Model for RAG answer generation and RAGAS judging
`RAGAS_EMBED_MODEL`	`qwen3-embedding-4b`	Embedding model (metadata, not used by current metrics)
`RAGAS_RESULTS_PATH`	`/data/results`	Directory for result JSON files
`RAGAS_LANGFUSE_HOST`	`http://langfuse-web:3000`	Langfuse tracing endpoint
`RAGAS_LANGFUSE_PUBLIC_KEY`	`""`	Langfuse public key
`RAGAS_LANGFUSE_SECRET_KEY`	`""`	Langfuse secret key

API Endpoints

Method	Endpoint	Description
GET	`/health`	Liveness probe — returns `{"status":"ok"}`
POST	`/evaluate`	Run RAGAS evaluation (synchronous, 10-20 min)
GET	`/results/latest`	Return the most recent result JSON

See openapi.yaml for full request/response schemas.

Quality Gate Thresholds

Metric	Hard Gate	Aspirational Target
Context Precision	>= 0.65	>= 0.72
Faithfulness	>= 0.50	>= 0.85

Both metrics must pass the hard gate for quality_gate.passed to be true. The aspirational targets are informational only — reported in quality gate runs but not enforced.

The quality_gate module (src/quality_gate.py) and regression module (src/regression.py) provide CLI tools for scripted quality gate enforcement and baseline comparison. See USER_GUIDE.md for usage.

Quality Gate History

Date	Context Precision	Faithfulness	Questions	Result
2026-02-27	0.4496	0.9470	68/78	FAILED (CP below 0.65 — dataset expanded to 78q)
2026-02-24	0.672	—	25/25	PASSED (original 25q baseline)

The 0.672 Context Precision baseline was achieved on the original 25-question dataset after ingesting MEMORY.md into haiven-knowledge. When the dataset was expanded to 78 questions across 5 domains, Context Precision dropped to 0.4496 — indicating knowledge gaps in the expanded question set. Faithfulness measured 0.9470, well above the 0.50 gate, showing the RAG pipeline answers faithfully from whatever context it retrieves.

Dependencies

Service	Role
haiven-knowledge (8022)	Provides retrieval contexts via `POST /v1/search`
LiteLLM (4000)	Routes judge LLM calls to GLM-4.7-Flash
GLM-4.7-Flash (vllm-glm-flash, port 6000, Alpha GPU)	Judge model and RAG answer generator
Langfuse (optional)	Trace logging

Results Storage

Results are saved to /mnt/apps/data/ragas/ (host) → /data/results (container). Filenames use ISO timestamps:

2026-02-24T18-30-00.000.json
2026-02-24T18-30-00.000-post-memory-md-ingest.json   # when label is set

Each result JSON contains:
- timestamp, label, judge_model, embed_model, knowledge_url
- question_count, scored_count
- scores: {context_precision: float, faithfulness: float}
- quality_gate: per-metric threshold, score, and pass/fail
- per_question: list of per-question detail including contexts preview, generated answer, and individual metric scores

Quality gate runs are saved separately as quality-gate-<YYYY-MM-DD-HHMM>.json in the same directory.

Regression reports are saved to /data/results/regression/ as <timestamp>-regression-report.md and .json.

Deployment

# Start
cd /mnt/apps/docker/ai/haiven-ragas
docker compose up -d

# Rebuild after source code changes
docker compose up -d --build

# View logs
docker logs -f haiven-ragas

# Health check
curl http://localhost:8470/health

Known Issues and Gotchas

GLM wraps JSON in code fences — handled automatically by CodeFenceStrippingChatOpenAI. If you switch judge models, verify the new model returns raw JSON.
RAGAS 0.4.3 returns List[float], not scalar — never use float(result["context_precision"]) directly. Always filter NaN and average manually.
RunConfig max_workers must stay at 2 — increasing it overwhelms GLM's max-num-seqs=4 and causes timeouts. If GLM's max-num-seqs is increased, max_workers can be increased proportionally (to at most half of max-num-seqs).
Evaluation is synchronous and slow — 10-20 minutes for the full golden dataset (answer generation + two RAGAS metric passes). Do not set HTTP client timeouts below 25 minutes.
No context = score 0 — when haiven-knowledge returns no results for a question, a [no context retrieved] placeholder forces a score of 0. This surfaces retrieval gaps in the metric rather than silently skipping rows.
No healthcheck in compose — the container has no Docker healthcheck defined. docker ps shows Up immediately after start. The service is ready within a few seconds of startup (no heavy model loading).

Source Code Structure

/mnt/apps/src/haiven-ragas/
├── Dockerfile                 # python:3.12-slim, port 8470
├── requirements.txt           # ragas, fastapi, langchain-openai, datasets
└── src/
    ├── main.py                # FastAPI app, request/response models, endpoints
    ├── evaluator.py           # RAGAS evaluation logic, CodeFenceStrippingChatOpenAI
    ├── config.py              # Pydantic settings (RAGAS_ prefix)
    ├── golden_dataset.py      # 78 question/ground_truth pairs across 5 domains
    ├── quality_gate.py        # CLI: enforce hard gates, save report to disk
    └── regression.py          # CLI: baseline comparison, per-question delta report

Service	Relationship
haiven-knowledge (port 8022)	Subject of evaluation — its retrieval quality is what's being measured
haiven-reranker (port 8460)	Used by haiven-knowledge — reranker quality affects Context Precision
LiteLLM (port 4000)	Routes all LLM calls for both answer generation and RAGAS judging
GLM-4.7-Flash (port 6000)	Judge model and RAG answer generator

Documentation

USER_GUIDE.md — Running evaluations, reading scores, iterating on quality
openapi.yaml — OpenAPI 3.0 specification