haiven-ragas

RAGAS evaluation service for the haiven-knowledge RAG pipeline. Measures retrieval quality using Context Precision and Faithfulness metrics against a curated golden dataset of infrastructure questions. Uses GLM-4.7-Flash as the judge LLM via LiteLLM.

Overview

Property Value
Status Live
Version 1.1.0
Port 8470 (internal only — no Traefik routing)
Domain None
GPU None (CPU-only)
Network web (Docker internal)
Compose file /mnt/apps/docker/ai/haiven-ragas/docker-compose.yml
Source code /mnt/apps/src/haiven-ragas/
Results storage /mnt/apps/data/ragas/ (host) → /data/results (container)

Architecture

haiven-ragas wraps the RAGAS evaluation framework (v0.4.3) in a FastAPI service. On each evaluation run it:

  1. Queries haiven-knowledge for context chunks for every golden question
  2. Generates a RAG answer using GLM-4.7-Flash (thinking disabled for speed)
  3. Passes the question/contexts/ground_truth/answer tuple through RAGAS ContextPrecision and Faithfulness metrics
  4. Persists a timestamped result JSON and returns aggregated scores with quality gate verdict
POST /evaluate
    |
    +--> haiven-knowledge:8022/v1/search      (retrieve context chunks, top_k=10)
    |
    +--> LiteLLM:4000 --> GLM-4.7-Flash       (generate RAG answer per question)
    |
    +--> RAGAS evaluate()                      (ContextPrecision + Faithfulness)
    |        judge_llm: GLM-4.7-Flash via LiteLLM
    |        run_config: max_workers=2, timeout=300
    |
    +--> /data/results/<timestamp>[label].json  (persist result)
    |
    +--> EvaluateResponse (scores, quality_gate pass/fail)

haiven-ragas has no public domain and is only reachable from containers on the web Docker network.

RAGAS Metrics

Context Precision

Measures whether retrieved context chunks are relevant to the ground-truth answer. The judge LLM scores each chunk as relevant or irrelevant; position-weighted averaging produces the final score.

Context Precision does not measure answer quality — it measures whether the right context was retrieved.

Faithfulness

Measures whether the generated answer is factually grounded in the retrieved contexts. The judge LLM decomposes the answer into claims and checks each claim against the context.

A low Faithfulness score indicates the RAG answer is hallucinating — generating content not present in the retrieved chunks.

Golden Dataset

The golden dataset consists of 78 question/ground-truth pairs organized across five evaluation domains:

Domain Questions Focus
business 15 Infrastructure ops: GPU layout, service ports, storage, networking
learning 39 LLM config, vLLM, embedding, STT/TTS, knowledge pipeline, RAGAS
meta 12 Operator workflow, MCP tools, ingestion patterns, API usage
personal 6 Meeting notes, calendar, task integration
creative 6 TTS, STT audio handling, image generation

Selected examples:
- "What GPU runs GLM-4.7-Flash?""Alpha GPU (GPU-eef2a28f, GPU index 3)"
- "What vLLM max-num-seqs is set for GLM?""4"
- "Why does GLM-4.7-Flash require a CodeFenceStrippingChatOpenAI wrapper in RAGAS?"full explanation

The full dataset is defined in /mnt/apps/src/haiven-ragas/src/golden_dataset.py.

Note: The service was initially deployed against 25 questions (the business and part of learning domains). The dataset has since grown to 78 questions. The quality gate baseline of 0.672 was established at 25 questions.

Judge Model: GLM-4.7-Flash

GLM-4.7-Flash is used for two roles in each evaluation run:

  1. RAG answer generation — called directly via LiteLLM with enable_thinking: False to get answers in content without reasoning overhead
  2. RAGAS judge — called via CodeFenceStrippingChatOpenAI wrapper for ContextPrecision and Faithfulness metric scoring

GLM Behavior: Code Fence Wrapping

GLM-4.7-Flash consistently wraps its JSON output in markdown code fences (```json ... ```). RAGAS expects raw JSON. The CodeFenceStrippingChatOpenAI class in evaluator.py subclasses langchain_openai.ChatOpenAI and strips these fences in both _generate and _agenerate paths.

If the judge model is changed to a different LLM, verify the new model returns raw JSON or add a similar wrapper.

GLM Behavior: max-num-seqs Limit

The vLLM instance serving GLM is configured with max-num-seqs=4. RAGAS defaults to more concurrent workers than this handles cleanly.

Fix applied: RunConfig(max_workers=2, timeout=300, max_retries=10, max_wait=60) — keeps concurrent LLM calls at 2, well within GLM's capacity.

GLM Behavior: RAG Answer Generation

GLM is called with extra_body: {"enable_thinking": False} so answers land in content. If content is None (rare GLM edge case), the code falls back to reasoning_content. This differs from Seed-36B, which uses chat_template_kwargs: {"thinking_budget": 0} — do not confuse the two parameter names.

RAGAS 0.4.3: Score Handling

RAGAS 0.4.3 returns result["context_precision"] and result["faithfulness"] as List[float] (one score per question row), not a scalar mean. Earlier RAGAS versions returned a scalar directly — this is a breaking change.

Required handling (from evaluator.py):

cp_scores_list = result["context_precision"]           # List[float], not float
cp_valid = [s for s in cp_scores_list if not math.isnan(s)]
context_precision = sum(cp_valid) / len(cp_valid) if cp_valid else 0.0

NaN scores occur when the judge LLM fails to return parseable JSON. The service:
- Logs a warning for each NaN question
- Averages only valid scores
- Reports scored_count vs question_count in the result JSON

Configuration

All settings use the RAGAS_ prefix (loaded from environment variables or .env):

Variable Default Description
RAGAS_KNOWLEDGE_URL http://haiven-knowledge:8022 haiven-knowledge search endpoint
RAGAS_LITELLM_URL http://litellm:4000 LiteLLM proxy URL
RAGAS_LITELLM_API_KEY "" LiteLLM master key
RAGAS_JUDGE_MODEL glm-4-7-flash Model for RAG answer generation and RAGAS judging
RAGAS_EMBED_MODEL qwen3-embedding-4b Embedding model (metadata, not used by current metrics)
RAGAS_RESULTS_PATH /data/results Directory for result JSON files
RAGAS_LANGFUSE_HOST http://langfuse-web:3000 Langfuse tracing endpoint
RAGAS_LANGFUSE_PUBLIC_KEY "" Langfuse public key
RAGAS_LANGFUSE_SECRET_KEY "" Langfuse secret key

API Endpoints

Method Endpoint Description
GET /health Liveness probe — returns {"status":"ok"}
POST /evaluate Run RAGAS evaluation (synchronous, 10-20 min)
GET /results/latest Return the most recent result JSON

See openapi.yaml for full request/response schemas.

Quality Gate Thresholds

Metric Hard Gate Aspirational Target
Context Precision >= 0.65 >= 0.72
Faithfulness >= 0.50 >= 0.85

Both metrics must pass the hard gate for quality_gate.passed to be true. The aspirational targets are informational only — reported in quality gate runs but not enforced.

The quality_gate module (src/quality_gate.py) and regression module (src/regression.py) provide CLI tools for scripted quality gate enforcement and baseline comparison. See USER_GUIDE.md for usage.

Quality Gate History

Date Context Precision Faithfulness Questions Result
2026-02-27 0.4496 0.9470 68/78 FAILED (CP below 0.65 — dataset expanded to 78q)
2026-02-24 0.672 25/25 PASSED (original 25q baseline)

The 0.672 Context Precision baseline was achieved on the original 25-question dataset after ingesting MEMORY.md into haiven-knowledge. When the dataset was expanded to 78 questions across 5 domains, Context Precision dropped to 0.4496 — indicating knowledge gaps in the expanded question set. Faithfulness measured 0.9470, well above the 0.50 gate, showing the RAG pipeline answers faithfully from whatever context it retrieves.

Dependencies

Service Role
haiven-knowledge (8022) Provides retrieval contexts via POST /v1/search
LiteLLM (4000) Routes judge LLM calls to GLM-4.7-Flash
GLM-4.7-Flash (vllm-glm-flash, port 6000, Alpha GPU) Judge model and RAG answer generator
Langfuse (optional) Trace logging

Results Storage

Results are saved to /mnt/apps/data/ragas/ (host) → /data/results (container). Filenames use ISO timestamps:

2026-02-24T18-30-00.000.json
2026-02-24T18-30-00.000-post-memory-md-ingest.json   # when label is set

Each result JSON contains:
- timestamp, label, judge_model, embed_model, knowledge_url
- question_count, scored_count
- scores: {context_precision: float, faithfulness: float}
- quality_gate: per-metric threshold, score, and pass/fail
- per_question: list of per-question detail including contexts preview, generated answer, and individual metric scores

Quality gate runs are saved separately as quality-gate-<YYYY-MM-DD-HHMM>.json in the same directory.

Regression reports are saved to /data/results/regression/ as <timestamp>-regression-report.md and .json.

Deployment

# Start
cd /mnt/apps/docker/ai/haiven-ragas
docker compose up -d

# Rebuild after source code changes
docker compose up -d --build

# View logs
docker logs -f haiven-ragas

# Health check
curl http://localhost:8470/health

Known Issues and Gotchas

  1. GLM wraps JSON in code fences — handled automatically by CodeFenceStrippingChatOpenAI. If you switch judge models, verify the new model returns raw JSON.

  2. RAGAS 0.4.3 returns List[float], not scalar — never use float(result["context_precision"]) directly. Always filter NaN and average manually.

  3. RunConfig max_workers must stay at 2 — increasing it overwhelms GLM's max-num-seqs=4 and causes timeouts. If GLM's max-num-seqs is increased, max_workers can be increased proportionally (to at most half of max-num-seqs).

  4. Evaluation is synchronous and slow — 10-20 minutes for the full golden dataset (answer generation + two RAGAS metric passes). Do not set HTTP client timeouts below 25 minutes.

  5. No context = score 0 — when haiven-knowledge returns no results for a question, a [no context retrieved] placeholder forces a score of 0. This surfaces retrieval gaps in the metric rather than silently skipping rows.

  6. No healthcheck in compose — the container has no Docker healthcheck defined. docker ps shows Up immediately after start. The service is ready within a few seconds of startup (no heavy model loading).

Source Code Structure

/mnt/apps/src/haiven-ragas/
├── Dockerfile                 # python:3.12-slim, port 8470
├── requirements.txt           # ragas, fastapi, langchain-openai, datasets
└── src/
    ├── main.py                # FastAPI app, request/response models, endpoints
    ├── evaluator.py           # RAGAS evaluation logic, CodeFenceStrippingChatOpenAI
    ├── config.py              # Pydantic settings (RAGAS_ prefix)
    ├── golden_dataset.py      # 78 question/ground_truth pairs across 5 domains
    ├── quality_gate.py        # CLI: enforce hard gates, save report to disk
    └── regression.py          # CLI: baseline comparison, per-question delta report
Service Relationship
haiven-knowledge (port 8022) Subject of evaluation — its retrieval quality is what's being measured
haiven-reranker (port 8460) Used by haiven-knowledge — reranker quality affects Context Precision
LiteLLM (port 4000) Routes all LLM calls for both answer generation and RAGAS judging
GLM-4.7-Flash (port 6000) Judge model and RAG answer generator

Documentation