← Back to all posts
AITutorial

Let's Build a Customer Support AI Copilot: An Event-Driven Agent with LangGraph, Go, pgvector & Redis Streams [Part 4]

Karan Kashyap

Karan Kashyap

June 27, 2026

Let's Build a Customer Support AI Copilot: An Event-Driven Agent with LangGraph, Go, pgvector & Redis Streams [Part 4]

Part 4 — Closing the Loop: The Eval Harness


Shipping an AI agent without evals is flying blind. You change a prompt, tweak retrieval weights, swap a model — and have no idea whether the agent got better or worse. This post builds the complete eval harness: a golden dataset, a two-tier scoring pipeline, an LLM-as-judge rubric, and a CI gate that blocks merges when gated metrics drop below threshold.

The big design constraint: evals must run on a CPU-only GitHub Actions runner with a small local model. That forces us to be precise about what we measure and at what cost.


The Core Problem: Measuring Generalization, Not Memorization

The agent retrieves from a knowledge base built from Bitext customer-support examples. If the eval set was seeded from the same rows that built the KB, you'd be measuring whether the retriever can echo its own training data — not whether it can help a real customer.

The fix: stratified split before anything else. Twenty percent of each intent's rows are held out; they never touch the KB. The eval set measures whether the agent can generalize to similar (but unseen) questions.

text
text
1pipeline/
2├── dataset.py # load, split, build KB docs, write golden.jsonl
3├── ingest_bitext.py # orchestrates the 6-step pipeline
4└── eval/
5 ├── cases.py # GoldenCase dataclass, load_cases, sample
6 ├── metrics.py # RoutingResult, QualityResult, Metrics.compute
7 ├── rubric.py # LLM-as-judge prompt + parse_judge
8 ├── report.py # render_markdown, persist to eval_runs
9 ├── run_eval.py # two-tier main loop, exit code = gate result
10 └── tests/
11 └── test_eval.py # unit tests — no LLM, no DB required

Step 1: The Ingest Pipeline

make ingest runs ingest_bitext.py, which executes six steps in order. It's idempotent — re-running resets the KB before loading, so you never get duplicate embeddings.

pipeline/ingest_bitext.py
python
1# pipeline/ingest_bitext.py
2print(f"[1/6] loading Bitext (limit={limit or 'all'}) ...")
3rows = dataset.load_bitext(limit)
4print(f" {len(rows)} usable rows")
5
6print(f"[2/6] stratified split (eval_frac={eval_frac}) ...")
7kb_rows, held_out = dataset.stratified_split(rows, eval_frac=eval_frac)
8print(f" KB seed={len(kb_rows)} held-out={len(held_out)}")
9
10print("[3/6] building deduped KB docs ...")
11docs = dataset.build_kb_docs(kb_rows)
12chunked = [(d, ch) for d in docs for ch in dataset.chunk_text(d.content)]
13print(f" {len(docs)} docs -> {len(chunked)} chunks")
14
15print("[4/6] embedding chunks ...")
16embedder = from_env()
17embedded: list[db.EmbeddedDoc] = []
18for batch in tqdm(list(_batches(chunked, embed_batch)), unit="batch"):
19 vectors = embedder.embed([ch for _, ch in batch])
20 for (d, ch), vec in zip(batch, vectors):
21 embedded.append(db.EmbeddedDoc(d.source, d.intent, d.category, d.title, ch, vec))
22
23print("[5/6] upserting to pgvector ...")
24conn = db.connect(dsn)
25db.reset_kb(conn)
26inserted = db.insert_docs(conn, embedded)
27written = dataset.write_golden(held_out, golden_path)
28
29print("[6/6] sanity similarity query ...")
30# Embed a held-out query and check nearest neighbours — catches broken embeddings
31sample = held_out[0]
32qvec = embedder.embed([sample["instruction"]])[0]
33hits = db.nearest(conn, qvec, k=5)

The stratified split in dataset.py holds out at least one row per intent regardless of group size, keeping every intent represented on both sides of the split:

pipeline/dataset.py
python
1# pipeline/dataset.py
2def stratified_split(
3 rows: list[dict], eval_frac: float = 0.2, seed: int = 42
4) -> tuple[list[dict], list[dict]]:
5 by_intent: dict[str, list[dict]] = defaultdict(list)
6 for r in rows:
7 by_intent[r["intent"]].append(r)
8
9 rng = random.Random(seed)
10 kb_seed: list[dict] = []
11 held_out: list[dict] = []
12 for intent, group in by_intent.items():
13 rng.shuffle(group)
14 n_eval = max(1, round(len(group) * eval_frac)) if len(group) > 1 else 0
15 held_out.extend(group[:n_eval])
16 kb_seed.extend(group[n_eval:])
17 return kb_seed, held_out

KB documents are deduplicated — up to three distinct phrasings per (category, intent) pair, so common phrasings are represented without flooding the vector index:

pipeline/dataset.py
python
1# pipeline/dataset.py
2def build_kb_docs(kb_rows: list[dict], max_per_group: int = 3) -> list[KBDoc]:
3 grouped: dict[tuple[str, str], list[str]] = defaultdict(list)
4 for r in kb_rows:
5 grouped[(r["category"], r["intent"])].append(r["response"])
6
7 docs: list[KBDoc] = []
8 for (category, intent), responses in grouped.items():
9 seen: set[str] = set()
10 kept = 0
11 for resp in responses:
12 key = _normalize(resp)
13 if key in seen:
14 continue
15 seen.add(key)
16 docs.append(KBDoc(
17 source="bitext", intent=intent, category=category,
18 title=f"{category} · {intent}", content=resp,
19 ))
20 kept += 1
21 if kept >= max_per_group:
22 break
23 return docs

Step 2: The Golden Set

write_golden serializes the held-out rows as JSONL. Each line is a GoldenCase — instruction (customer message), intent label, category label, and the reference response from a human-authored KB answer.

pipeline/eval/cases.py
python
1# pipeline/eval/cases.py
2@dataclass(frozen=True)
3class GoldenCase:
4 instruction: str
5 intent: str
6 category: str
7 response: str # gold reference for the LLM judge
8
9
10def load_cases(path: str) -> list[GoldenCase]:
11 cases: list[GoldenCase] = []
12 with open(path, encoding="utf-8") as f:
13 for line in f:
14 line = line.strip()
15 if not line:
16 continue
17 r = json.loads(line)
18 cases.append(GoldenCase(
19 instruction=r["instruction"], intent=r["intent"],
20 category=r["category"], response=r["response"],
21 ))
22 return cases
23
24
25def sample(cases: list[GoldenCase], n: int, seed: int = 42) -> list[GoldenCase]:
26 """Deterministic sample so eval runs are comparable across commits."""
27 if n <= 0 or n >= len(cases):
28 return list(cases)
29 rng = random.Random(seed)
30 return rng.sample(cases, n)

seed=42 makes the sample identical across every run on the same golden set. A prompt change on Monday and a retrieval change on Tuesday are scored against the exact same cases — you can diff the numbers.


Step 3: Two-Tier Eval Architecture

The most important architectural decision in the harness. Running the full agent graph (triage → retrieve → draft → guard → decision) over a large golden set takes minutes per case on a CPU runner. But measuring routing accuracy only needs triage + retrieve — two fast, cheap nodes.

Split accordingly:

Two Tier Eval Pipeline

This split matters. Routing tier runs 30 cases; quality tier runs 8. Quality tier is ~4× slower per case because it also calls a judge model. On a CPU runner with qwen2.5:3b, the split keeps the eval-gate job under 40 minutes.


Step 4: Per-Case Result Types

pipeline/eval/metrics.py
python
1# pipeline/eval/metrics.py
2@dataclass
3class RoutingResult:
4 # Gates on CATEGORY — the operational decision (which queue/team).
5 # Intent is finer-grained: reported separately, feeds retrieval but not the gate.
6 category_hit: bool
7 intent_hit: bool
8 recall_hit: bool # gold intent present among retrieved docs
9 tokens: int
10
11
12@dataclass
13class QualityResult:
14 grounded: bool
15 answer_score: float # LLM-judge mean, normalized 0..1
16 safety_violation: bool # finalized draft proposing a forbidden action
17 finalized: bool
18 tokens: int
19 cost_cents: float
20 latency_ms: int

Category (billing, shipping, returns) vs intent (cancel_order, track_refund) matters here. The routing gate uses category because that maps to a queue or team — the operational decision you need right for the product to work. Intent accuracy is valuable signal but a wrong intent that lands in the right category is a soft miss, not a hard failure.


Step 5: Two Eval Functions

The routing function invokes only the two fast nodes:

pipeline/eval/run_eval.py
python
1# pipeline/eval/run_eval.py
2def _eval_routing(case: GoldenCase, deps: Deps) -> RoutingResult:
3 state = {"message": _msg(case.instruction)}
4 out = triage_node(state, deps)
5 state["triage"] = out["triage"]
6 docs = retrieve_node(state, deps)["retrieved"]
7 triage = out["triage"]
8 return RoutingResult(
9 category_hit=triage.category == case.category,
10 intent_hit=triage.intent == case.intent,
11 recall_hit=any(d.intent == case.intent for d in docs),
12 tokens=out.get("tokens_used", 0),
13 )

The quality function runs the full graph and calls the judge:

pipeline/eval/run_eval.py
python
1# pipeline/eval/run_eval.py
2def _eval_quality(case: GoldenCase, graph, chat, judge_model: str, max_steps: int) -> QualityResult:
3 t0 = time.monotonic()
4 final = graph.invoke(
5 {"message": _msg(case.instruction), "trace_id": "", "repair_count": 0},
6 {"recursion_limit": max_steps},
7 )
8 latency = int((time.monotonic() - t0) * 1000)
9 draft, guard = final["draft"], final["guard"]
10 finalized = final["decision"] == "finalize"
11 score, jtok = rubric.score_answer(chat, judge_model, case.instruction, case.response, draft.answer)
12 return QualityResult(
13 grounded=guard.grounded,
14 answer_score=score,
15 # Forbidden action is a violation only if the draft was actually finalized.
16 safety_violation=finalized and policy.proposes_forbidden_action(draft.suggested_action),
17 finalized=finalized,
18 tokens=final.get("tokens_used", 0) + jtok,
19 cost_cents=final.get("cost_cents", 0.0),
20 latency_ms=latency,
21 )

The safety_violation flag is deliberate: a draft the guard already escalated cannot produce a safety violation even if it contains a forbidden action, because it was never finalized.


Step 6: The LLM-as-Judge Rubric

The judge model scores a candidate answer against the gold reference on three dimensions — correctness, helpfulness, and tone. It returns strict JSON. Parsing is separated from the model call so it can be unit-tested without any LLM.

pipeline/eval/rubric.py
python
1# pipeline/eval/rubric.py
2JUDGE_SYSTEM = (
3 "You are a strict QA reviewer for customer-support replies. Compare the "
4 "CANDIDATE reply to the REFERENCE answer for the customer's message. Score "
5 "each dimension from 1 (poor) to 5 (excellent): correctness (factually "
6 "consistent with the reference), helpfulness (resolves the request), tone "
7 "(professional, empathetic). Return ONLY a JSON object: "
8 '{"correctness": int, "helpfulness": int, "tone": int}. No prose.'
9)
10
11
12def parse_judge(text: str) -> float:
13 """Parse the judge JSON into a 0..1 score. Returns 0.0 on anything invalid."""
14 try:
15 data = json.loads(text)
16 dims = [data["correctness"], data["helpfulness"], data["tone"]]
17 except (json.JSONDecodeError, KeyError, TypeError):
18 return 0.0
19 vals = []
20 for d in dims:
21 if not isinstance(d, (int, float)):
22 return 0.0
23 vals.append(max(1.0, min(5.0, float(d))))
24 mean = sum(vals) / len(vals)
25 return round((mean - 1.0) / 4.0, 4) # 1..5 -> 0..1

parse_judge is defensive at every layer: JSON parse failure, missing key, and non-numeric value all return 0.0 rather than crashing the eval run. An out-of-range number is treated differently — it's clamped into [1, 5] rather than zeroed, since a judge that returns 9 instead of 5 is still saying "excellent," not emitting garbage. A small model sometimes emits extra prose before the JSON — that's a garbled response, not a 0-scoring answer, but the safe choice there is to score it 0 rather than attempt heroic extraction.


Step 7: Metrics Aggregation and Gating

pipeline/eval/metrics.py
python
1# pipeline/eval/metrics.py
2GROUNDEDNESS_MIN = float(os.getenv("EVAL_GROUNDEDNESS_MIN", "0.9"))
3ROUTING_MIN = float(os.getenv("EVAL_ROUTING_MIN", "0.85"))
4
5
6@dataclass
7class Metrics:
8 dataset: str
9 n: int
10 routing_accuracy: float # gated
11 intent_accuracy: float # tracked
12 retrieval_recall: float # tracked
13 groundedness: float # gated
14 answer_score: float # tracked — PRD sets no numeric target
15 safety_violations: int # gated: must be 0
16 avg_cost_cents: float
17 p95_latency_ms: int
18 failures: list[str] = field(default_factory=list)
19
20 @classmethod
21 def compute(
22 cls, dataset: str, routing: list[RoutingResult], quality: list[QualityResult]
23 ) -> "Metrics":
24 routing_acc = _pct(sum(r.category_hit for r in routing), len(routing))
25 grounded = _pct(sum(q.grounded for q in quality), len(quality))
26 violations = sum(q.safety_violation for q in quality)
27
28 failures: list[str] = []
29 if grounded < GROUNDEDNESS_MIN:
30 failures.append(f"groundedness {grounded:.2%} < {GROUNDEDNESS_MIN:.0%}")
31 if routing_acc < ROUTING_MIN:
32 failures.append(f"routing {routing_acc:.2%} < {ROUTING_MIN:.0%}")
33 if violations > 0:
34 failures.append(f"{violations} forbidden action(s) emitted (must be 0)")
35
36 return cls(...)
37
38 @property
39 def passed(self) -> bool:
40 return not self.failures

Three hard gates: groundedness ≥ 90%, routing accuracy (category) ≥ 85%, zero safety violations. answer_score is not gated — it's tracked for quality trends across commits. A 0.2 judge score doesn't fail a build; a fabricated citation does.

p95_latency_ms uses the 95th-percentile formula:

python
python
1def _p95(values: list[int]) -> int:
2 if not values:
3 return 0
4 s = sorted(values)
5 idx = min(len(s) - 1, int(round(0.95 * (len(s) - 1))))
6 return s[idx]

Step 8: Markdown Report and Persistence

render_markdown is a pure function — no I/O, testable without a DB:

pipeline/eval/report.py
python
1# pipeline/eval/report.py
2def render_markdown(m: Metrics, prompt_version: str = "") -> str:
3 verdict = "✅ PASS" if m.passed else "❌ FAIL"
4 ts = dt.datetime.now(dt.timezone.utc).strftime("%Y-%m-%d %H:%M UTC")
5 lines = [
6 "# Resolver eval report",
7 f"- **Verdict:** {verdict}",
8 f"- **Dataset:** `{m.dataset}` · **Cases (routing):** {m.n}",
9 f"- **Run at:** {ts}" + (f" · **Prompts:** `{prompt_version}`" if prompt_version else ""),
10 "| Metric | Value | Target |",
11 "| --- | --- | --- |",
12 f"| Routing accuracy (category) | {m.routing_accuracy:.2%} | ≥ {ROUTING_MIN:.0%} |",
13 f"| Groundedness | {m.groundedness:.2%} | ≥ {GROUNDEDNESS_MIN:.0%} |",
14 f"| Answer score (judge) | {m.answer_score:.2f} | tracked |",
15 f"| Safety violations | {m.safety_violations} | 0 |",
16 ]
17 if m.failures:
18 lines.append("## Failures")
19 lines += [f"- {f}" for f in m.failures]
20 else:
21 lines.append("_All gated metrics met their targets._")
22 return "\n".join(lines)

A passing run looks like this:

text
text
1# Resolver eval report
2
3- **Verdict:** ✅ PASS
4- **Dataset:** `golden` · **Cases (routing):** 30
5- **Run at:** 2026-06-24 13:23 UTC · **Prompts:** `p1`
6
7| Metric | Value | Target |
8| --- | --- | --- |
9| Routing accuracy (category) | 93.33% | ≥ 85% |
10| Intent accuracy | 80.00% | tracked |
11| Retrieval recall@k | 86.67% | tracked |
12| Groundedness | 100.00% | ≥ 90% |
13| Answer score (judge) | 0.48 | tracked |
14| Safety violations | 0 | 0 |
15| Avg cost | 0.0000¢ | tracked |
16| p95 latency | 10311 ms | tracked |
17
18_All gated metrics met their targets._

persist writes the run to eval_runs via the worker's db module, which the Go API can expose through an evalRuns query — giving the console a historical view of eval trends:

pipeline/eval/report.py
python
1# pipeline/eval/report.py
2def persist(conn, m: Metrics) -> str:
3 import db # worker module on PYTHONPATH in the eval container
4 return db.insert_eval_run(
5 conn,
6 dataset=m.dataset, n=m.n, groundedness=m.groundedness,
7 routing_accuracy=m.routing_accuracy, answer_score=m.answer_score,
8 retrieval_recall=m.retrieval_recall, safety_violations=m.safety_violations,
9 avg_cost_cents=m.avg_cost_cents, p95_latency_ms=m.p95_latency_ms,
10 )

Step 9: The Main Loop

pipeline/eval/run_eval.py
python
1# pipeline/eval/run_eval.py
2def main() -> int:
3 cases = load_cases(golden_path)
4 routing_cases = sample(cases, routing_n) # e.g. 30
5 draft_cases = sample(cases, draft_n) # e.g. 8
6
7 # Routing tier: triage + retrieve
8 routing = []
9 for c in routing_cases:
10 routing.append(_eval_routing(c, deps))
11
12 # Quality tier: full graph + judge
13 quality = []
14 for c in draft_cases:
15 quality.append(_eval_quality(c, graph, chat, cfg.judge_model, cfg.max_graph_steps))
16
17 m = Metrics.compute(dataset, routing, quality)
18 md = report.render_markdown(m, PROMPT_VERSION)
19 report.write_report(report_path, md)
20 run_id = report.persist(conn, m)
21
22 print("\n" + md)
23 return 0 if m.passed else 1 # non-zero exit → CI gate fails

return 0 if m.passed else 1 is the CI gate. GitHub Actions checks the exit code; a non-zero exit fails the job.


Step 10: Unit Tests — Encoding Why

The eval tests run with no LLM and no DB. They verify the gate logic itself — if someone lowers a threshold or changes how safety violations are counted, the tests fail.

pipeline/eval/tests/test_eval.py
python
1# pipeline/eval/tests/test_eval.py
2
3class ParseJudge(unittest.TestCase):
4 def test_normalizes_to_unit_interval(self):
5 self.assertEqual(rubric.parse_judge('{"correctness":5,"helpfulness":5,"tone":5}'), 1.0)
6 self.assertEqual(rubric.parse_judge('{"correctness":1,"helpfulness":1,"tone":1}'), 0.0)
7
8 def test_clamps_out_of_range(self):
9 # A judge returning 9 must not push the score above 1.
10 self.assertEqual(rubric.parse_judge('{"correctness":9,"helpfulness":5,"tone":5}'), 1.0)
11
12 def test_invalid_scores_zero(self):
13 # A garbled judge response must not silently count as a good answer.
14 for bad in ['not json', '{"correctness":5}', '{"correctness":"x","helpfulness":5,"tone":5}']:
15 self.assertEqual(rubric.parse_judge(bad), 0.0)
16
17
18class MetricsGate(unittest.TestCase):
19 def test_low_category_routing_fails_gate(self):
20 m = Metrics.compute("golden", _routing(5, 8, 10), _quality(10, [0.8] * 10))
21 self.assertFalse(m.passed)
22 self.assertTrue(any("routing" in f for f in m.failures))
23
24 def test_low_answer_score_is_not_gated(self):
25 # PRD sets no numeric answer-score target — it's tracked, not a hard gate.
26 # A weak judge score must not fail the build on its own.
27 m = Metrics.compute("golden", _routing(10, 10, 10), _quality(10, [0.2] * 10))
28 self.assertTrue(m.passed, m.failures)
29 self.assertEqual(m.answer_score, 0.2)
30
31 def test_low_groundedness_fails_gate(self):
32 # Ungrounded answers are the core risk — must be a hard gate.
33 m = Metrics.compute("golden", _routing(10, 10, 10), _quality(5, [0.8] * 10))
34 self.assertFalse(m.passed)
35 self.assertTrue(any("groundedness" in f for f in m.failures))
36
37 def test_any_forbidden_action_fails_gate(self):
38 m = Metrics.compute("golden", _routing(10, 10, 10), _quality(10, [0.9] * 10, violations=1))
39 self.assertEqual(m.safety_violations, 1)
40 self.assertFalse(m.passed)
41
42 def test_intent_accuracy_reported_separately(self):
43 # Intent can diverge from category without failing the gate.
44 m = Metrics.compute("golden", _routing(9, 9, 10, intent_hits=4), _quality(10, [0.8] * 10))
45 self.assertEqual(m.routing_accuracy, 0.9)
46 self.assertEqual(m.intent_accuracy, 0.4)

Each test encodes a business rule, not just behavior. test_low_answer_score_is_not_gated would fail if someone added answer_score to the gated failures list — which is exactly what you want: the test catches that change and forces a decision.


Step 11: The CI Eval Gate

The full CI pipeline has five jobs. The eval-gate job is the heavy one — it boots infra, pulls models, ingests, and runs the harness.

.github/workflows/ci.yml
yaml
1# .github/workflows/ci.yml
2jobs:
3 go:
4 name: Go API (gqlgen + build + test)
5 runs-on: ubuntu-latest
6 steps:
7 - uses: actions/checkout@v4
8 - uses: actions/setup-go@v5
9 with:
10 go-version: "1.25"
11 - name: Regenerate gqlgen (schema-first; generated code is not committed)
12 working-directory: services/api
13 run: go run github.com/99designs/gqlgen generate
14 - run: go build ./...
15 - run: go test ./...
16
17 python:
18 name: Python (worker + eval + pipeline tests)
19 runs-on: ubuntu-latest
20 steps:
21 - uses: actions/checkout@v4
22 - uses: actions/setup-python@v5
23 with:
24 python-version: "3.12"
25 - run: pip install -r workers/agent/requirements.txt
26 - name: Worker node + graph tests
27 working-directory: workers/agent
28 run: |
29 python tests/test_nodes.py
30 python tests/test_graph.py
31 python tests/test_rag.py
32 - name: Eval harness tests
33 working-directory: pipeline
34 run: python eval/tests/test_eval.py
35
36 eval-gate:
37 name: Sampled eval gate (groundedness/routing/safety)
38 runs-on: ubuntu-latest
39 timeout-minutes: 40
40 steps:
41 - uses: actions/checkout@v4
42 - name: Configure env (single small model for all tiers)
43 run: |
44 cp .env.example .env
45 cat >> .env <<'EOF'
46 TRIAGE_MODEL=qwen2.5:3b
47 DRAFT_MODEL=qwen2.5:3b
48 JUDGE_MODEL=qwen2.5:3b
49 INGEST_LIMIT=4000
50 EVAL_ROUTING_SAMPLE=12
51 EVAL_DRAFT_SAMPLE=3
52 EOF
53 - name: Boot infra
54 run: docker compose -f deploy/docker-compose.yml up -d postgres redis ollama
55 - name: Pull models
56 run: |
57 docker compose -f deploy/docker-compose.yml exec -T ollama ollama pull nomic-embed-text
58 docker compose -f deploy/docker-compose.yml exec -T ollama ollama pull qwen2.5:3b
59 - name: Migrate + ingest KB and golden set
60 run: |
61 docker compose -f deploy/docker-compose.yml run --rm migrate
62 docker compose -f deploy/docker-compose.yml --profile tools run --rm --build pipeline \
63 python ingest_bitext.py
64 - name: Run eval gate
65 run: docker compose -f deploy/docker-compose.yml --profile tools run --rm --build eval
66 - name: Publish report
67 if: always() # upload even on failure so you can read what broke
68 uses: actions/upload-artifact@v4
69 with:
70 name: eval-report
71 path: pipeline/eval/reports/REPORT.md

Key choices in the CI config:

  • EVAL_ROUTING_SAMPLE=12 / EVAL_DRAFT_SAMPLE=3 — CI uses reduced samples to keep the job under 40 minutes on a free runner. Local make eval runs the full defaults (30/8).
  • qwen2.5:3b for all three models — triage, draft, and judge all use the same small model in CI. Local dev can point DRAFT_MODEL at something stronger.
  • if: always() on the report upload — you get REPORT.md as a build artifact even when the gate fails, so you can read the failure messages without SSH-ing into the runner.

The Eval Flow End-to-End

Eval Flow


Running It Locally

bash
bash
1# First-time setup
2docker compose up -d postgres redis ollama
3make ingest # ~3 min with INGEST_LIMIT=4000
4make eval # runs both tiers; prints report; exits 0 on pass

Output sample:

text
text
1loaded 874 golden cases; routing=30 draft=8
2scoring routing (triage + retrieve)...
3 routing 30/30
4scoring quality (full graph + judge)...
5 quality 1/8 done
6 quality 2/8 done
7 ...
8
9# Resolver eval report
10- **Verdict:** ✅ PASS
11| Routing accuracy (category) | 93.33% | ≥ 85% |
12| Groundedness | 100.00% | ≥ 90% |
13| Safety violations | 0 | 0 |
14
15eval_run id: 01JXYZ... · report: /out/REPORT.md

What We Have

text
text
1pipeline/
2├── dataset.py — stratified split, KB doc dedup, golden.jsonl writer
3├── ingest_bitext.py — 6-step idempotent ingest, sanity-query after upsert
4└── eval/
5 ├── cases.py — GoldenCase frozen dataclass, deterministic sample
6 ├── metrics.py — two result types, Metrics.compute, p95, gate logic
7 ├── rubric.py — judge prompt, parse_judge (defensive, testable standalone)
8 ├── report.py — render_markdown (pure), persist to eval_runs
9 ├── run_eval.py — two-tier main, exit 0 / 1
10 └── tests/
11 └── test_eval.py — gate tests encoding business rules, no I/O needed

Three hard gates protect the production-critical properties: routing accuracy (right queue), groundedness (no fabricated citations), safety (zero forbidden actions). Everything else — answer quality, latency, cost — is tracked for trend analysis without blocking a merge.

Blog series · 6 parts

Let's Build a Customer Support Co-Pilot

an Event-Driven AI Agent with LangGraph, Go, pgvector & Redis Streams

View on GitHub
GoPythonpgvectorRedisNext.jsDockerLangGraph

Ready to Build Something Extraordinary?

Let's discuss your idea. We'll show you how AI-powered development can compress your timeline and budget — without cutting corners.

We respond within 24 hours. No sales pitch — just a straight conversation about your project.

More from the Blog

Explore more engineering insights, case studies, and technical deep-dives.

View all posts →
Let's Build a Print-Ready Die-Cut Sticker SaaS from scratch in Golang & Next.js [Part 6]
AITutorial

Let's Build a Print-Ready Die-Cut Sticker SaaS from scratch in Golang & Next.js [Part 6]

DieCutGo Studio turns any uploaded artwork into a print-ready die-cut sticker — background removal, contour tracing, print-readiness checks, mockups, and a shareable storefront, all backed by a Go pipeline fast enough to feel instant. Over this series I'll walk through how the whole thing is built, starting today with the least glamorous but most consequential decision: how the repo itself is laid out.

Karan KashyapJul 4, 2026
Let's Build a Print-Ready Die-Cut Sticker SaaS from scratch in Golang & Next.js [Part 5]
AITutorial

Let's Build a Print-Ready Die-Cut Sticker SaaS from scratch in Golang & Next.js [Part 5]

DieCutGo Studio turns any uploaded artwork into a print-ready die-cut sticker — background removal, contour tracing, print-readiness checks, mockups, and a shareable storefront, all backed by a Go pipeline fast enough to feel instant. Over this series I'll walk through how the whole thing is built, starting today with the least glamorous but most consequential decision: how the repo itself is laid out.

Karan KashyapJul 3, 2026
Let's Build a Print-Ready Die-Cut Sticker SaaS from scratch in Golang & Next.js [Part 4]
AITutorial

Let's Build a Print-Ready Die-Cut Sticker SaaS from scratch in Golang & Next.js [Part 4]

DieCutGo Studio turns any uploaded artwork into a print-ready die-cut sticker — background removal, contour tracing, print-readiness checks, mockups, and a shareable storefront, all backed by a Go pipeline fast enough to feel instant. Over this series I'll walk through how the whole thing is built, starting today with the least glamorous but most consequential decision: how the repo itself is laid out.

Karan KashyapJul 3, 2026
Let's Build a Print-Ready Die-Cut Sticker SaaS from scratch in Golang & Next.js [Part 3]
AITutorial

Let's Build a Print-Ready Die-Cut Sticker SaaS from scratch in Golang & Next.js [Part 3]

DieCutGo Studio turns any uploaded artwork into a print-ready die-cut sticker — background removal, contour tracing, print-readiness checks, mockups, and a shareable storefront, all backed by a Go pipeline fast enough to feel instant. Over this series I'll walk through how the whole thing is built, starting today with the least glamorous but most consequential decision: how the repo itself is laid out.

Karan KashyapJul 2, 2026