Practical guide for running RAGAS evaluations, reading scores, improving haiven-knowledge quality, and using the quality gate and regression tools.
# Check the service is up
curl http://localhost:8470/health
# {"status":"ok"}
# Run a full evaluation (blocks 10-20 minutes)
curl -s -X POST http://localhost:8470/evaluate \
-H "Content-Type: application/json" \
-d '{"dataset": "golden"}' | jq '{scores, quality_gate}'
# Get the most recent result
curl -s http://localhost:8470/results/latest | jq '{scores, quality_gate}'
Run an evaluation when you:
- Ingest new documents into haiven-knowledge
- Change the embedding model or Qdrant collection
- Tune search parameters (score threshold, top-k limit, reranker settings)
- Want to verify the pipeline still meets the quality gate after infrastructure changes
- Are debugging poor retrieval quality on specific topics
A single evaluation produces two scores you can directly compare against the established baselines.
curl -s -X POST http://localhost:8470/evaluate \
-H "Content-Type: application/json" \
-d '{"dataset": "golden"}' | jq .
Attach a label to correlate the result with a specific change. The label appears in the result filename and the saved JSON:
curl -s -X POST http://localhost:8470/evaluate \
-H "Content-Type: application/json" \
-d '{"dataset": "golden", "label": "after-qdrant-reindex"}' | jq .
Result filename: 2026-02-24T18-30-00.000-after-qdrant-reindex.json
The endpoint blocks for the full evaluation duration (10-20 minutes). To run in the background:
curl -s -X POST http://localhost:8470/evaluate \
-H "Content-Type: application/json" \
-d '{"dataset": "golden", "label": "my-change"}' \
> /tmp/ragas-result.json &
echo "Running in background — PID $!"
Then check progress via logs:
docker logs -f haiven-ragas
golden dataset is supportedRequesting any other dataset name returns HTTP 400:
{"detail": "Unknown dataset 'custom'. Only 'golden' is supported."}
A successful evaluation returns:
{
"status": "ok",
"scores": {
"context_precision": 0.672,
"faithfulness": 0.81
},
"quality_gate": {
"passed": true,
"metrics": {
"context_precision": {
"score": 0.672,
"threshold": 0.65,
"passed": true
},
"faithfulness": {
"score": 0.81,
"threshold": 0.50,
"passed": true
}
}
},
"question_count": 78,
"timestamp": "2026-02-28T10:00:00.000000+00:00",
"result_file": "/data/results/2026-02-28T10-00-00.000.json",
"label": null
}
quality_gate.passed is true only when both metrics pass their respective hard gate thresholds.
Measures whether the chunks retrieved for each question are relevant to the ground-truth answer. The judge LLM scores each retrieved chunk as relevant or irrelevant; results are position-weighted (earlier chunks matter more).
| Score | Interpretation |
|---|---|
| >= 0.80 | Excellent — most retrieved chunks are directly relevant |
| 0.65 – 0.79 | Good — quality gate passes, retrieval is working well |
| 0.50 – 0.64 | Degraded — below gate threshold, investigate low-scoring questions |
| < 0.50 | Poor — significant knowledge base or search configuration issue |
Hard gate: >= 0.65 | Aspirational: >= 0.72 | Current baseline: 0.672 (2026-02-24)
Measures whether the generated answer is factually grounded in the retrieved contexts. The judge LLM decomposes the answer into individual claims and verifies each claim against the context.
| Score | Interpretation |
|---|---|
| >= 0.85 | Excellent — generated answers closely follow retrieved context |
| 0.70 – 0.84 | Good — minor hallucination, acceptable |
| 0.50 – 0.69 | Marginal — quality gate passes but answers sometimes extend beyond context |
| < 0.50 | Failing — answers are regularly contradicting or ignoring the retrieved context |
Hard gate: >= 0.50 | Aspirational: >= 0.85
A low Faithfulness score with high Context Precision means the right chunks are being retrieved but the answer generator is hallucinating. A low Context Precision score with high Faithfulness means the model is faithfully answering from the (wrong) chunks it did retrieve.
curl -s http://localhost:8470/results/latest | jq .
ls -lt /mnt/apps/data/ragas/*.json
cat /mnt/apps/data/ragas/2026-02-24T18-30-00.000.json | jq .
curl -s http://localhost:8470/results/latest | \
jq '.per_question[] | {question: .question[:60], context_precision, faithfulness}' | \
head -60
The saved result JSON includes per_question detail for every golden question:
{
"per_question": [
{
"question": "What GPU runs GLM-4.7-Flash?",
"ground_truth": "Alpha GPU (GPU-eef2a28f, GPU index 3)",
"context_count": 8,
"contexts_preview": [
"GPU-eef2a28f is the Alpha GPU running GLM-4.7-Flash...",
"Alpha runs GLM-4.7-Flash and the embedding model..."
],
"generated_answer": "GLM-4.7-Flash runs on the Alpha GPU (GPU-eef2a28f).",
"context_precision": 0.92,
"faithfulness": 1.0
},
{
"question": "What database does work-hub use?",
"ground_truth": "PostgreSQL",
"context_count": 3,
"contexts_preview": ["work-hub connects to...", "PostgreSQL handles..."],
"generated_answer": "work-hub uses PostgreSQL.",
"context_precision": 0.33,
"faithfulness": 1.0
}
]
}
A null score means the judge LLM failed to produce a parseable response for that row. The question is excluded from the average.
curl -s http://localhost:8470/results/latest | \
jq '.per_question | sort_by(.context_precision) | .[:5] | .[] | {question: .question[:70], context_precision, context_count}'
Common causes of low Context Precision:
POST /v1/ingest/text or the markdown ingestion script.bash
curl -s -X POST http://localhost:8022/v1/search \
-H "Content-Type: application/json" \
-d '{"query": "What GPU runs GLM-4.7-Flash?", "limit": 10, "score_threshold": 0.15}' \
| jq '.results[] | {score, content: .content[:100]}'Common causes of low Faithfulness:
contexts_preview for the low-scoring question — if only 1-2 sparse chunks were retrieved, the answer generator had little to work with.The golden dataset is defined in source code at /mnt/apps/src/haiven-ragas/src/golden_dataset.py. It is embedded at container build time — changing it requires a rebuild.
GOLDEN_QUESTIONS = [
{
"question": "What GPU runs GLM-4.7-Flash?",
"ground_truth": "Alpha GPU (GPU-eef2a28f, GPU index 3)",
"domain": "business", # business | learning | meta | personal | creative
},
...
]
GOLDEN_QUESTIONS in golden_dataset.pydocker compose -f /mnt/apps/docker/ai/haiven-ragas/docker-compose.yml up -d --buildGuidance for good golden questions:
- Questions should have a single, factual answer that exists in the knowledge base
- Avoid questions with subjective answers or answers that change frequently
- Cover all major knowledge domains so the score reflects overall pipeline health
- A domain with no questions is invisible to the quality gate
The quality_gate.py module provides a CLI for scripted quality gate enforcement (e.g., in CI pipelines or pre-deployment checks):
# Run from inside the container (after exec)
docker exec haiven-ragas python -m src.quality_gate
# With custom thresholds
docker exec haiven-ragas python -m src.quality_gate \
--context-precision-min 0.70 \
--faithfulness-min 0.55
# Against a different service URL
docker exec haiven-ragas python -m src.quality_gate \
--ragas-url http://localhost:8470
Exit codes:
- 0 — all metrics meet or exceed thresholds
- 1 — one or more metrics below threshold
- 2 — evaluation failed (service unreachable, timeout, etc.)
Reports are saved to /data/results/quality-gate-<YYYY-MM-DD-HHMM>.json.
The regression.py module compares a current evaluation run against a saved baseline and flags per-question regressions:
# First run: creates a baseline, exits 0
docker exec haiven-ragas python -m src.regression
# Second run: compares against baseline, produces regression report
docker exec haiven-ragas python -m src.regression
# Save current run as new baseline after a deliberate improvement
docker exec haiven-ragas python -m src.regression --save-baseline
# Compare against a specific baseline file
docker exec haiven-ragas python -m src.regression \
--baseline /data/results/baselines/2026-02-24-ragas-baseline.json
# Tighten the regression threshold (default 0.05)
docker exec haiven-ragas python -m src.regression --threshold 0.03
Reports are saved to /data/results/regression/ as both .md and .json. A question is flagged as REGRESSED if its Context Precision or Faithfulness drops by more than the threshold compared to baseline.
Exit codes:
- 0 — no regressions (or baseline created)
- 1 — at least one question regressed beyond threshold
- 2 — evaluation failed
The standard improvement cycle:
bash
curl -s -X POST http://localhost:8470/evaluate \
-H "Content-Type: application/json" \
-d '{"dataset": "golden", "label": "added-gpu-docs"}' | jq .scorescontext_precision and faithfulness to the previous run.What moved the score to 0.672 (historical example):
Before 2026-02-24, questions about GPU assignments ("What GPU runs GLM-4.7-Flash?", "What GPU runs Seed-36B?") were scoring near 0 because haiven-knowledge lacked the specific GPU UUID-to-service mappings. Ingesting MEMORY.md (which contains 11 chunks of dense infrastructure facts) provided the exact context these questions needed. Context Precision jumped from the low-0.5 range to 0.672.
# Start
cd /mnt/apps/docker/ai/haiven-ragas && docker compose up -d
# Stop
docker compose down
# Rebuild after source code changes (golden dataset, evaluator logic)
docker compose up -d --build
# Restart without rebuild (config/env changes)
docker restart haiven-ragas
# View logs (live)
docker logs -f haiven-ragas
# View recent logs
docker logs --tail 100 haiven-ragas
Check that LiteLLM and GLM are reachable:
docker logs haiven-ragas --tail 50
curl -s http://localhost:4000/health
curl -s http://localhost:6000/health
The judge LLM returned unparseable output for every question. Most common causes:
- LiteLLM is routing to a different model than expected
- GLM is returning a response format the code fence stripper can't handle (check raw GLM output)
- Network issue between haiven-ragas and LiteLLM
Check raw GLM output:
curl -s -X POST http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(grep RAGAS_LITELLM_API_KEY /mnt/apps/docker/ai/haiven-ragas/.env | cut -d= -f2)" \
-d '{"model":"glm-4-7-flash","messages":[{"role":"user","content":"Return JSON: {\"answer\":\"test\"}"}],"max_tokens":50,"extra_body":{"enable_thinking":false}}' \
| jq .choices[0].message
If content is null and the answer is in reasoning_content, thinking mode may be enabled unexpectedly. Verify enable_thinking: false is being passed through LiteLLM.
If Context Precision drops more than 0.1 after a change to haiven-knowledge:
A full run with 78 questions takes 10-20 minutes. If it exceeds 30 minutes, check:
# Is GLM responsive?
curl -s -X POST http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $RAGAS_LITELLM_API_KEY" \
-d '{"model":"glm-4-7-flash","messages":[{"role":"user","content":"hi"}],"max_tokens":10,"extra_body":{"enable_thinking":false}}' \
| jq .choices[0].message
# Is haiven-knowledge responsive?
curl -s -X POST http://localhost:8022/v1/search \
-H "Content-Type: application/json" \
-d '{"query":"test","limit":3}' | jq .
If GLM is slow (GPU under load from other requests), the evaluation will slow down proportionally. RunConfig(max_workers=2) limits concurrency, but each individual LLM call can still take 60-120 seconds under load.
The results directory /mnt/apps/data/ragas/ may not exist yet or may contain no .json files. Run a first evaluation to create it.
ls /mnt/apps/data/ragas/
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Liveness probe — returns {"status":"ok"} |
/evaluate |
POST | Run RAGAS evaluation (synchronous, 10-20 min) |
/results/latest |
GET | Return the most recent result JSON |
See openapi.yaml for full request/response schemas and example values.