haiven-ragas User Guide

Practical guide for running RAGAS evaluations, reading scores, improving haiven-knowledge quality, and using the quality gate and regression tools.

Quick Start

# Check the service is up
curl http://localhost:8470/health
# {"status":"ok"}

# Run a full evaluation (blocks 10-20 minutes)
curl -s -X POST http://localhost:8470/evaluate \
  -H "Content-Type: application/json" \
  -d '{"dataset": "golden"}' | jq '{scores, quality_gate}'

# Get the most recent result
curl -s http://localhost:8470/results/latest | jq '{scores, quality_gate}'

When to Run an Evaluation

Run an evaluation when you:
- Ingest new documents into haiven-knowledge
- Change the embedding model or Qdrant collection
- Tune search parameters (score threshold, top-k limit, reranker settings)
- Want to verify the pipeline still meets the quality gate after infrastructure changes
- Are debugging poor retrieval quality on specific topics

A single evaluation produces two scores you can directly compare against the established baselines.

Running an Evaluation

Basic run

curl -s -X POST http://localhost:8470/evaluate \
  -H "Content-Type: application/json" \
  -d '{"dataset": "golden"}' | jq .

Labeled run

Attach a label to correlate the result with a specific change. The label appears in the result filename and the saved JSON:

curl -s -X POST http://localhost:8470/evaluate \
  -H "Content-Type: application/json" \
  -d '{"dataset": "golden", "label": "after-qdrant-reindex"}' | jq .

Result filename: 2026-02-24T18-30-00.000-after-qdrant-reindex.json

Background run

The endpoint blocks for the full evaluation duration (10-20 minutes). To run in the background:

curl -s -X POST http://localhost:8470/evaluate \
  -H "Content-Type: application/json" \
  -d '{"dataset": "golden", "label": "my-change"}' \
  > /tmp/ragas-result.json &
echo "Running in background — PID $!"

Then check progress via logs:

docker logs -f haiven-ragas

Only the golden dataset is supported

Requesting any other dataset name returns HTTP 400:

{"detail": "Unknown dataset 'custom'. Only 'golden' is supported."}

Reading the Response

A successful evaluation returns:

{
  "status": "ok",
  "scores": {
    "context_precision": 0.672,
    "faithfulness": 0.81
  },
  "quality_gate": {
    "passed": true,
    "metrics": {
      "context_precision": {
        "score": 0.672,
        "threshold": 0.65,
        "passed": true
      },
      "faithfulness": {
        "score": 0.81,
        "threshold": 0.50,
        "passed": true
      }
    }
  },
  "question_count": 78,
  "timestamp": "2026-02-28T10:00:00.000000+00:00",
  "result_file": "/data/results/2026-02-28T10-00-00.000.json",
  "label": null
}

quality_gate.passed is true only when both metrics pass their respective hard gate thresholds.

Understanding the Scores

Context Precision

Measures whether the chunks retrieved for each question are relevant to the ground-truth answer. The judge LLM scores each retrieved chunk as relevant or irrelevant; results are position-weighted (earlier chunks matter more).

Score Interpretation
>= 0.80 Excellent — most retrieved chunks are directly relevant
0.65 – 0.79 Good — quality gate passes, retrieval is working well
0.50 – 0.64 Degraded — below gate threshold, investigate low-scoring questions
< 0.50 Poor — significant knowledge base or search configuration issue

Hard gate: >= 0.65 | Aspirational: >= 0.72 | Current baseline: 0.672 (2026-02-24)

Faithfulness

Measures whether the generated answer is factually grounded in the retrieved contexts. The judge LLM decomposes the answer into individual claims and verifies each claim against the context.

Score Interpretation
>= 0.85 Excellent — generated answers closely follow retrieved context
0.70 – 0.84 Good — minor hallucination, acceptable
0.50 – 0.69 Marginal — quality gate passes but answers sometimes extend beyond context
< 0.50 Failing — answers are regularly contradicting or ignoring the retrieved context

Hard gate: >= 0.50 | Aspirational: >= 0.85

A low Faithfulness score with high Context Precision means the right chunks are being retrieved but the answer generator is hallucinating. A low Context Precision score with high Faithfulness means the model is faithfully answering from the (wrong) chunks it did retrieve.

Viewing Results

Latest result

curl -s http://localhost:8470/results/latest | jq .

List result files on disk (newest first)

ls -lt /mnt/apps/data/ragas/*.json

Read a specific result file

cat /mnt/apps/data/ragas/2026-02-24T18-30-00.000.json | jq .

View per-question scores from the latest result

curl -s http://localhost:8470/results/latest | \
  jq '.per_question[] | {question: .question[:60], context_precision, faithfulness}' | \
  head -60

Interpreting Per-Question Scores

The saved result JSON includes per_question detail for every golden question:

{
  "per_question": [
    {
      "question": "What GPU runs GLM-4.7-Flash?",
      "ground_truth": "Alpha GPU (GPU-eef2a28f, GPU index 3)",
      "context_count": 8,
      "contexts_preview": [
        "GPU-eef2a28f is the Alpha GPU running GLM-4.7-Flash...",
        "Alpha runs GLM-4.7-Flash and the embedding model..."
      ],
      "generated_answer": "GLM-4.7-Flash runs on the Alpha GPU (GPU-eef2a28f).",
      "context_precision": 0.92,
      "faithfulness": 1.0
    },
    {
      "question": "What database does work-hub use?",
      "ground_truth": "PostgreSQL",
      "context_count": 3,
      "contexts_preview": ["work-hub connects to...", "PostgreSQL handles..."],
      "generated_answer": "work-hub uses PostgreSQL.",
      "context_precision": 0.33,
      "faithfulness": 1.0
    }
  ]
}

A null score means the judge LLM failed to produce a parseable response for that row. The question is excluded from the average.

Finding low-scoring questions

curl -s http://localhost:8470/results/latest | \
  jq '.per_question | sort_by(.context_precision) | .[:5] | .[] | {question: .question[:70], context_precision, context_count}'

Common causes of low Context Precision:

Common causes of low Faithfulness:

Managing the Golden Dataset

The golden dataset is defined in source code at /mnt/apps/src/haiven-ragas/src/golden_dataset.py. It is embedded at container build time — changing it requires a rebuild.

Dataset structure

GOLDEN_QUESTIONS = [
    {
        "question": "What GPU runs GLM-4.7-Flash?",
        "ground_truth": "Alpha GPU (GPU-eef2a28f, GPU index 3)",
        "domain": "business",  # business | learning | meta | personal | creative
    },
    ...
]

Adding questions

  1. Add entries to GOLDEN_QUESTIONS in golden_dataset.py
  2. Rebuild the container: docker compose -f /mnt/apps/docker/ai/haiven-ragas/docker-compose.yml up -d --build
  3. Run an evaluation to establish a new baseline score for the expanded dataset
  4. Update the quality gate history in README.md

Guidance for good golden questions:
- Questions should have a single, factual answer that exists in the knowledge base
- Avoid questions with subjective answers or answers that change frequently
- Cover all major knowledge domains so the score reflects overall pipeline health
- A domain with no questions is invisible to the quality gate

Quality Gate CLI

The quality_gate.py module provides a CLI for scripted quality gate enforcement (e.g., in CI pipelines or pre-deployment checks):

# Run from inside the container (after exec)
docker exec haiven-ragas python -m src.quality_gate

# With custom thresholds
docker exec haiven-ragas python -m src.quality_gate \
  --context-precision-min 0.70 \
  --faithfulness-min 0.55

# Against a different service URL
docker exec haiven-ragas python -m src.quality_gate \
  --ragas-url http://localhost:8470

Exit codes:
- 0 — all metrics meet or exceed thresholds
- 1 — one or more metrics below threshold
- 2 — evaluation failed (service unreachable, timeout, etc.)

Reports are saved to /data/results/quality-gate-<YYYY-MM-DD-HHMM>.json.

Regression CLI

The regression.py module compares a current evaluation run against a saved baseline and flags per-question regressions:

# First run: creates a baseline, exits 0
docker exec haiven-ragas python -m src.regression

# Second run: compares against baseline, produces regression report
docker exec haiven-ragas python -m src.regression

# Save current run as new baseline after a deliberate improvement
docker exec haiven-ragas python -m src.regression --save-baseline

# Compare against a specific baseline file
docker exec haiven-ragas python -m src.regression \
  --baseline /data/results/baselines/2026-02-24-ragas-baseline.json

# Tighten the regression threshold (default 0.05)
docker exec haiven-ragas python -m src.regression --threshold 0.03

Reports are saved to /data/results/regression/ as both .md and .json. A question is flagged as REGRESSED if its Context Precision or Faithfulness drops by more than the threshold compared to baseline.

Exit codes:
- 0 — no regressions (or baseline created)
- 1 — at least one question regressed beyond threshold
- 2 — evaluation failed

Iterating on Quality

The standard improvement cycle:

  1. Run evaluation, note overall scores and any per-question low scorers.
  2. Identify the 3-5 lowest-scoring questions by Context Precision.
  3. Debug retrieval for those questions using the knowledge search endpoint.
  4. If knowledge is missing: ingest the relevant document or add a direct text entry.
  5. Re-run evaluation with a descriptive label:
    bash curl -s -X POST http://localhost:8470/evaluate \ -H "Content-Type: application/json" \ -d '{"dataset": "golden", "label": "added-gpu-docs"}' | jq .scores
  6. Compare context_precision and faithfulness to the previous run.
  7. If scores improved and you want to track regression from this new state: save a baseline.

What moved the score to 0.672 (historical example):

Before 2026-02-24, questions about GPU assignments ("What GPU runs GLM-4.7-Flash?", "What GPU runs Seed-36B?") were scoring near 0 because haiven-knowledge lacked the specific GPU UUID-to-service mappings. Ingesting MEMORY.md (which contains 11 chunks of dense infrastructure facts) provided the exact context these questions needed. Context Precision jumped from the low-0.5 range to 0.672.

Service Management

# Start
cd /mnt/apps/docker/ai/haiven-ragas && docker compose up -d

# Stop
docker compose down

# Rebuild after source code changes (golden dataset, evaluator logic)
docker compose up -d --build

# Restart without rebuild (config/env changes)
docker restart haiven-ragas

# View logs (live)
docker logs -f haiven-ragas

# View recent logs
docker logs --tail 100 haiven-ragas

Troubleshooting

Evaluation returns 500 — "RAGAS evaluate() failed"

Check that LiteLLM and GLM are reachable:

docker logs haiven-ragas --tail 50
curl -s http://localhost:4000/health
curl -s http://localhost:6000/health

All scores are NaN

The judge LLM returned unparseable output for every question. Most common causes:
- LiteLLM is routing to a different model than expected
- GLM is returning a response format the code fence stripper can't handle (check raw GLM output)
- Network issue between haiven-ragas and LiteLLM

Check raw GLM output:

curl -s -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(grep RAGAS_LITELLM_API_KEY /mnt/apps/docker/ai/haiven-ragas/.env | cut -d= -f2)" \
  -d '{"model":"glm-4-7-flash","messages":[{"role":"user","content":"Return JSON: {\"answer\":\"test\"}"}],"max_tokens":50,"extra_body":{"enable_thinking":false}}' \
  | jq .choices[0].message

If content is null and the answer is in reasoning_content, thinking mode may be enabled unexpectedly. Verify enable_thinking: false is being passed through LiteLLM.

Scores drop significantly after a change

If Context Precision drops more than 0.1 after a change to haiven-knowledge:

  1. Look at per-question scores to identify which questions regressed.
  2. Run the regression CLI to get a structured diff against the previous baseline.
  3. Check whether a Qdrant collection change removed previously indexed chunks.
  4. Verify the score threshold default (0.3) hasn't changed — a higher threshold means fewer chunks returned.

Evaluation hangs indefinitely

A full run with 78 questions takes 10-20 minutes. If it exceeds 30 minutes, check:

# Is GLM responsive?
curl -s -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $RAGAS_LITELLM_API_KEY" \
  -d '{"model":"glm-4-7-flash","messages":[{"role":"user","content":"hi"}],"max_tokens":10,"extra_body":{"enable_thinking":false}}' \
  | jq .choices[0].message

# Is haiven-knowledge responsive?
curl -s -X POST http://localhost:8022/v1/search \
  -H "Content-Type: application/json" \
  -d '{"query":"test","limit":3}' | jq .

If GLM is slow (GPU under load from other requests), the evaluation will slow down proportionally. RunConfig(max_workers=2) limits concurrency, but each individual LLM call can still take 60-120 seconds under load.

404 on /results/latest

The results directory /mnt/apps/data/ragas/ may not exist yet or may contain no .json files. Run a first evaluation to create it.

ls /mnt/apps/data/ragas/

API Reference

Endpoint Method Description
/health GET Liveness probe — returns {"status":"ok"}
/evaluate POST Run RAGAS evaluation (synchronous, 10-20 min)
/results/latest GET Return the most recent result JSON

See openapi.yaml for full request/response schemas and example values.