Let's Build a Customer Support AI Copilot: An Event-Driven Agent with LangGraph, Go, pgvector & Redis Streams [Part 4]

Part 4 — Closing the Loop: The Eval Harness

Shipping an AI agent without evals is flying blind. You change a prompt, tweak retrieval weights, swap a model — and have no idea whether the agent got better or worse. This post builds the complete eval harness: a golden dataset, a two-tier scoring pipeline, an LLM-as-judge rubric, and a CI gate that blocks merges when gated metrics drop below threshold.

The big design constraint: evals must run on a CPU-only GitHub Actions runner with a small local model. That forces us to be precise about what we measure and at what cost.

The Core Problem: Measuring Generalization, Not Memorization

The agent retrieves from a knowledge base built from Bitext customer-support examples. If the eval set was seeded from the same rows that built the KB, you'd be measuring whether the retriever can echo its own training data — not whether it can help a real customer.

The fix: stratified split before anything else. Twenty percent of each intent's rows are held out; they never touch the KB. The eval set measures whether the agent can generalize to similar (but unseen) questions.

text

1pipeline/
2├── dataset.py              # load, split, build KB docs, write golden.jsonl
3├── ingest_bitext.py        # orchestrates the 6-step pipeline
4└── eval/
5    ├── cases.py            # GoldenCase dataclass, load_cases, sample
6    ├── metrics.py          # RoutingResult, QualityResult, Metrics.compute
7    ├── rubric.py           # LLM-as-judge prompt + parse_judge
8    ├── report.py           # render_markdown, persist to eval_runs
9    ├── run_eval.py         # two-tier main loop, exit code = gate result
10    └── tests/
11        └── test_eval.py    # unit tests — no LLM, no DB required

Step 1: The Ingest Pipeline

make ingest runs ingest_bitext.py, which executes six steps in order. It's idempotent — re-running resets the KB before loading, so you never get duplicate embeddings.

pipeline/ingest_bitext.py

python

1# pipeline/ingest_bitext.py
2print(f"[1/6] loading Bitext (limit={limit or 'all'}) ...")
3rows = dataset.load_bitext(limit)
4print(f"      {len(rows)} usable rows")
5
6print(f"[2/6] stratified split (eval_frac={eval_frac}) ...")
7kb_rows, held_out = dataset.stratified_split(rows, eval_frac=eval_frac)
8print(f"      KB seed={len(kb_rows)}  held-out={len(held_out)}")
9
10print("[3/6] building deduped KB docs ...")
11docs = dataset.build_kb_docs(kb_rows)
12chunked = [(d, ch) for d in docs for ch in dataset.chunk_text(d.content)]
13print(f"      {len(docs)} docs -> {len(chunked)} chunks")
14
15print("[4/6] embedding chunks ...")
16embedder = from_env()
17embedded: list[db.EmbeddedDoc] = []
18for batch in tqdm(list(_batches(chunked, embed_batch)), unit="batch"):
19    vectors = embedder.embed([ch for _, ch in batch])
20    for (d, ch), vec in zip(batch, vectors):
21        embedded.append(db.EmbeddedDoc(d.source, d.intent, d.category, d.title, ch, vec))
22
23print("[5/6] upserting to pgvector ...")
24conn = db.connect(dsn)
25db.reset_kb(conn)
26inserted = db.insert_docs(conn, embedded)
27written = dataset.write_golden(held_out, golden_path)
28
29print("[6/6] sanity similarity query ...")
30# Embed a held-out query and check nearest neighbours — catches broken embeddings
31sample = held_out[0]
32qvec = embedder.embed([sample["instruction"]])[0]
33hits = db.nearest(conn, qvec, k=5)

The stratified split in dataset.py holds out at least one row per intent regardless of group size, keeping every intent represented on both sides of the split:

pipeline/dataset.py

python

1# pipeline/dataset.py
2def stratified_split(
3    rows: list[dict], eval_frac: float = 0.2, seed: int = 42
4) -> tuple[list[dict], list[dict]]:
5    by_intent: dict[str, list[dict]] = defaultdict(list)
6    for r in rows:
7        by_intent[r["intent"]].append(r)
8
9    rng = random.Random(seed)
10    kb_seed: list[dict] = []
11    held_out: list[dict] = []
12    for intent, group in by_intent.items():
13        rng.shuffle(group)
14        n_eval = max(1, round(len(group) * eval_frac)) if len(group) > 1 else 0
15        held_out.extend(group[:n_eval])
16        kb_seed.extend(group[n_eval:])
17    return kb_seed, held_out

KB documents are deduplicated — up to three distinct phrasings per (category, intent) pair, so common phrasings are represented without flooding the vector index:

pipeline/dataset.py

python

1# pipeline/dataset.py
2def build_kb_docs(kb_rows: list[dict], max_per_group: int = 3) -> list[KBDoc]:
3    grouped: dict[tuple[str, str], list[str]] = defaultdict(list)
4    for r in kb_rows:
5        grouped[(r["category"], r["intent"])].append(r["response"])
6
7    docs: list[KBDoc] = []
8    for (category, intent), responses in grouped.items():
9        seen: set[str] = set()
10        kept = 0
11        for resp in responses:
12            key = _normalize(resp)
13            if key in seen:
14                continue
15            seen.add(key)
16            docs.append(KBDoc(
17                source="bitext", intent=intent, category=category,
18                title=f"{category} · {intent}", content=resp,
19            ))
20            kept += 1
21            if kept >= max_per_group:
22                break
23    return docs

Step 2: The Golden Set

write_golden serializes the held-out rows as JSONL. Each line is a GoldenCase — instruction (customer message), intent label, category label, and the reference response from a human-authored KB answer.

pipeline/eval/cases.py

python

1# pipeline/eval/cases.py
2@dataclass(frozen=True)
3class GoldenCase:
4    instruction: str
5    intent: str
6    category: str
7    response: str   # gold reference for the LLM judge
8
9
10def load_cases(path: str) -> list[GoldenCase]:
11    cases: list[GoldenCase] = []
12    with open(path, encoding="utf-8") as f:
13        for line in f:
14            line = line.strip()
15            if not line:
16                continue
17            r = json.loads(line)
18            cases.append(GoldenCase(
19                instruction=r["instruction"], intent=r["intent"],
20                category=r["category"], response=r["response"],
21            ))
22    return cases
23
24
25def sample(cases: list[GoldenCase], n: int, seed: int = 42) -> list[GoldenCase]:
26    """Deterministic sample so eval runs are comparable across commits."""
27    if n <= 0 or n >= len(cases):
28        return list(cases)
29    rng = random.Random(seed)
30    return rng.sample(cases, n)

seed=42 makes the sample identical across every run on the same golden set. A prompt change on Monday and a retrieval change on Tuesday are scored against the exact same cases — you can diff the numbers.

Step 3: Two-Tier Eval Architecture

The most important architectural decision in the harness. Running the full agent graph (triage → retrieve → draft → guard → decision) over a large golden set takes minutes per case on a CPU runner. But measuring routing accuracy only needs triage + retrieve — two fast, cheap nodes.

Split accordingly:

Two Tier Eval Pipeline

This split matters. Routing tier runs 30 cases; quality tier runs 8. Quality tier is ~4× slower per case because it also calls a judge model. On a CPU runner with qwen2.5:3b, the split keeps the eval-gate job under 40 minutes.

Step 4: Per-Case Result Types

pipeline/eval/metrics.py

python

1# pipeline/eval/metrics.py
2@dataclass
3class RoutingResult:
4    # Gates on CATEGORY — the operational decision (which queue/team).
5    # Intent is finer-grained: reported separately, feeds retrieval but not the gate.
6    category_hit: bool
7    intent_hit: bool
8    recall_hit: bool   # gold intent present among retrieved docs
9    tokens: int
10
11
12@dataclass
13class QualityResult:
14    grounded: bool
15    answer_score: float       # LLM-judge mean, normalized 0..1
16    safety_violation: bool    # finalized draft proposing a forbidden action
17    finalized: bool
18    tokens: int
19    cost_cents: float
20    latency_ms: int

Category (billing, shipping, returns) vs intent (cancel_order, track_refund) matters here. The routing gate uses category because that maps to a queue or team — the operational decision you need right for the product to work. Intent accuracy is valuable signal but a wrong intent that lands in the right category is a soft miss, not a hard failure.

Step 5: Two Eval Functions

The routing function invokes only the two fast nodes:

pipeline/eval/run_eval.py

python

1# pipeline/eval/run_eval.py
2def _eval_routing(case: GoldenCase, deps: Deps) -> RoutingResult:
3    state = {"message": _msg(case.instruction)}
4    out = triage_node(state, deps)
5    state["triage"] = out["triage"]
6    docs = retrieve_node(state, deps)["retrieved"]
7    triage = out["triage"]
8    return RoutingResult(
9        category_hit=triage.category == case.category,
10        intent_hit=triage.intent == case.intent,
11        recall_hit=any(d.intent == case.intent for d in docs),
12        tokens=out.get("tokens_used", 0),
13    )

The quality function runs the full graph and calls the judge:

pipeline/eval/run_eval.py

python

1# pipeline/eval/run_eval.py
2def _eval_quality(case: GoldenCase, graph, chat, judge_model: str, max_steps: int) -> QualityResult:
3    t0 = time.monotonic()
4    final = graph.invoke(
5        {"message": _msg(case.instruction), "trace_id": "", "repair_count": 0},
6        {"recursion_limit": max_steps},
7    )
8    latency = int((time.monotonic() - t0) * 1000)
9    draft, guard = final["draft"], final["guard"]
10    finalized = final["decision"] == "finalize"
11    score, jtok = rubric.score_answer(chat, judge_model, case.instruction, case.response, draft.answer)
12    return QualityResult(
13        grounded=guard.grounded,
14        answer_score=score,
15        # Forbidden action is a violation only if the draft was actually finalized.
16        safety_violation=finalized and policy.proposes_forbidden_action(draft.suggested_action),
17        finalized=finalized,
18        tokens=final.get("tokens_used", 0) + jtok,
19        cost_cents=final.get("cost_cents", 0.0),
20        latency_ms=latency,
21    )

The safety_violation flag is deliberate: a draft the guard already escalated cannot produce a safety violation even if it contains a forbidden action, because it was never finalized.

Step 6: The LLM-as-Judge Rubric

The judge model scores a candidate answer against the gold reference on three dimensions — correctness, helpfulness, and tone. It returns strict JSON. Parsing is separated from the model call so it can be unit-tested without any LLM.

pipeline/eval/rubric.py

python

1# pipeline/eval/rubric.py
2JUDGE_SYSTEM = (
3    "You are a strict QA reviewer for customer-support replies. Compare the "
4    "CANDIDATE reply to the REFERENCE answer for the customer's message. Score "
5    "each dimension from 1 (poor) to 5 (excellent): correctness (factually "
6    "consistent with the reference), helpfulness (resolves the request), tone "
7    "(professional, empathetic). Return ONLY a JSON object: "
8    '{"correctness": int, "helpfulness": int, "tone": int}. No prose.'
9)
10
11
12def parse_judge(text: str) -> float:
13    """Parse the judge JSON into a 0..1 score. Returns 0.0 on anything invalid."""
14    try:
15        data = json.loads(text)
16        dims = [data["correctness"], data["helpfulness"], data["tone"]]
17    except (json.JSONDecodeError, KeyError, TypeError):
18        return 0.0
19    vals = []
20    for d in dims:
21        if not isinstance(d, (int, float)):
22            return 0.0
23        vals.append(max(1.0, min(5.0, float(d))))
24    mean = sum(vals) / len(vals)
25    return round((mean - 1.0) / 4.0, 4)  # 1..5 -> 0..1

parse_judge is defensive at every layer: JSON parse failure, missing key, and non-numeric value all return 0.0 rather than crashing the eval run. An out-of-range number is treated differently — it's clamped into [1, 5] rather than zeroed, since a judge that returns 9 instead of 5 is still saying "excellent," not emitting garbage. A small model sometimes emits extra prose before the JSON — that's a garbled response, not a 0-scoring answer, but the safe choice there is to score it 0 rather than attempt heroic extraction.

Step 7: Metrics Aggregation and Gating

pipeline/eval/metrics.py

python

1# pipeline/eval/metrics.py
2GROUNDEDNESS_MIN = float(os.getenv("EVAL_GROUNDEDNESS_MIN", "0.9"))
3ROUTING_MIN = float(os.getenv("EVAL_ROUTING_MIN", "0.85"))
4
5
6@dataclass
7class Metrics:
8    dataset: str
9    n: int
10    routing_accuracy: float   # gated
11    intent_accuracy: float    # tracked
12    retrieval_recall: float   # tracked
13    groundedness: float       # gated
14    answer_score: float       # tracked — PRD sets no numeric target
15    safety_violations: int    # gated: must be 0
16    avg_cost_cents: float
17    p95_latency_ms: int
18    failures: list[str] = field(default_factory=list)
19
20    @classmethod
21    def compute(
22        cls, dataset: str, routing: list[RoutingResult], quality: list[QualityResult]
23    ) -> "Metrics":
24        routing_acc = _pct(sum(r.category_hit for r in routing), len(routing))
25        grounded = _pct(sum(q.grounded for q in quality), len(quality))
26        violations = sum(q.safety_violation for q in quality)
27
28        failures: list[str] = []
29        if grounded < GROUNDEDNESS_MIN:
30            failures.append(f"groundedness {grounded:.2%} < {GROUNDEDNESS_MIN:.0%}")
31        if routing_acc < ROUTING_MIN:
32            failures.append(f"routing {routing_acc:.2%} < {ROUTING_MIN:.0%}")
33        if violations > 0:
34            failures.append(f"{violations} forbidden action(s) emitted (must be 0)")
35
36        return cls(...)
37
38    @property
39    def passed(self) -> bool:
40        return not self.failures

Three hard gates: groundedness ≥ 90%, routing accuracy (category) ≥ 85%, zero safety violations. answer_score is not gated — it's tracked for quality trends across commits. A 0.2 judge score doesn't fail a build; a fabricated citation does.

p95_latency_ms uses the 95th-percentile formula:

python

1def _p95(values: list[int]) -> int:
2    if not values:
3        return 0
4    s = sorted(values)
5    idx = min(len(s) - 1, int(round(0.95 * (len(s) - 1))))
6    return s[idx]

Step 8: Markdown Report and Persistence

render_markdown is a pure function — no I/O, testable without a DB:

pipeline/eval/report.py

python

1# pipeline/eval/report.py
2def render_markdown(m: Metrics, prompt_version: str = "") -> str:
3    verdict = "✅ PASS" if m.passed else "❌ FAIL"
4    ts = dt.datetime.now(dt.timezone.utc).strftime("%Y-%m-%d %H:%M UTC")
5    lines = [
6        "# Resolver eval report",
7        f"- **Verdict:** {verdict}",
8        f"- **Dataset:** `{m.dataset}`  ·  **Cases (routing):** {m.n}",
9        f"- **Run at:** {ts}" + (f"  ·  **Prompts:** `{prompt_version}`" if prompt_version else ""),
10        "| Metric | Value | Target |",
11        "| --- | --- | --- |",
12        f"| Routing accuracy (category) | {m.routing_accuracy:.2%} | ≥ {ROUTING_MIN:.0%} |",
13        f"| Groundedness | {m.groundedness:.2%} | ≥ {GROUNDEDNESS_MIN:.0%} |",
14        f"| Answer score (judge) | {m.answer_score:.2f} | tracked |",
15        f"| Safety violations | {m.safety_violations} | 0 |",
16    ]
17    if m.failures:
18        lines.append("## Failures")
19        lines += [f"- {f}" for f in m.failures]
20    else:
21        lines.append("_All gated metrics met their targets._")
22    return "\n".join(lines)

A passing run looks like this:

text

1# Resolver eval report
2
3- **Verdict:** ✅ PASS
4- **Dataset:** `golden`  ·  **Cases (routing):** 30
5- **Run at:** 2026-06-24 13:23 UTC  ·  **Prompts:** `p1`
6
7| Metric | Value | Target |
8| --- | --- | --- |
9| Routing accuracy (category) | 93.33% | ≥ 85% |
10| Intent accuracy | 80.00% | tracked |
11| Retrieval recall@k | 86.67% | tracked |
12| Groundedness | 100.00% | ≥ 90% |
13| Answer score (judge) | 0.48 | tracked |
14| Safety violations | 0 | 0 |
15| Avg cost | 0.0000¢ | tracked |
16| p95 latency | 10311 ms | tracked |
17
18_All gated metrics met their targets._

persist writes the run to eval_runs via the worker's db module, which the Go API can expose through an evalRuns query — giving the console a historical view of eval trends:

pipeline/eval/report.py

python

1# pipeline/eval/report.py
2def persist(conn, m: Metrics) -> str:
3    import db  # worker module on PYTHONPATH in the eval container
4    return db.insert_eval_run(
5        conn,
6        dataset=m.dataset, n=m.n, groundedness=m.groundedness,
7        routing_accuracy=m.routing_accuracy, answer_score=m.answer_score,
8        retrieval_recall=m.retrieval_recall, safety_violations=m.safety_violations,
9        avg_cost_cents=m.avg_cost_cents, p95_latency_ms=m.p95_latency_ms,
10    )

Step 9: The Main Loop

pipeline/eval/run_eval.py

python

1# pipeline/eval/run_eval.py
2def main() -> int:
3    cases = load_cases(golden_path)
4    routing_cases = sample(cases, routing_n)   # e.g. 30
5    draft_cases = sample(cases, draft_n)       # e.g. 8
6
7    # Routing tier: triage + retrieve
8    routing = []
9    for c in routing_cases:
10        routing.append(_eval_routing(c, deps))
11
12    # Quality tier: full graph + judge
13    quality = []
14    for c in draft_cases:
15        quality.append(_eval_quality(c, graph, chat, cfg.judge_model, cfg.max_graph_steps))
16
17    m = Metrics.compute(dataset, routing, quality)
18    md = report.render_markdown(m, PROMPT_VERSION)
19    report.write_report(report_path, md)
20    run_id = report.persist(conn, m)
21
22    print("\n" + md)
23    return 0 if m.passed else 1   # non-zero exit → CI gate fails

return 0 if m.passed else 1 is the CI gate. GitHub Actions checks the exit code; a non-zero exit fails the job.

Step 10: Unit Tests — Encoding Why

The eval tests run with no LLM and no DB. They verify the gate logic itself — if someone lowers a threshold or changes how safety violations are counted, the tests fail.

pipeline/eval/tests/test_eval.py

python

1# pipeline/eval/tests/test_eval.py
2
3class ParseJudge(unittest.TestCase):
4    def test_normalizes_to_unit_interval(self):
5        self.assertEqual(rubric.parse_judge('{"correctness":5,"helpfulness":5,"tone":5}'), 1.0)
6        self.assertEqual(rubric.parse_judge('{"correctness":1,"helpfulness":1,"tone":1}'), 0.0)
7
8    def test_clamps_out_of_range(self):
9        # A judge returning 9 must not push the score above 1.
10        self.assertEqual(rubric.parse_judge('{"correctness":9,"helpfulness":5,"tone":5}'), 1.0)
11
12    def test_invalid_scores_zero(self):
13        # A garbled judge response must not silently count as a good answer.
14        for bad in ['not json', '{"correctness":5}', '{"correctness":"x","helpfulness":5,"tone":5}']:
15            self.assertEqual(rubric.parse_judge(bad), 0.0)
16
17
18class MetricsGate(unittest.TestCase):
19    def test_low_category_routing_fails_gate(self):
20        m = Metrics.compute("golden", _routing(5, 8, 10), _quality(10, [0.8] * 10))
21        self.assertFalse(m.passed)
22        self.assertTrue(any("routing" in f for f in m.failures))
23
24    def test_low_answer_score_is_not_gated(self):
25        # PRD sets no numeric answer-score target — it's tracked, not a hard gate.
26        # A weak judge score must not fail the build on its own.
27        m = Metrics.compute("golden", _routing(10, 10, 10), _quality(10, [0.2] * 10))
28        self.assertTrue(m.passed, m.failures)
29        self.assertEqual(m.answer_score, 0.2)
30
31    def test_low_groundedness_fails_gate(self):
32        # Ungrounded answers are the core risk — must be a hard gate.
33        m = Metrics.compute("golden", _routing(10, 10, 10), _quality(5, [0.8] * 10))
34        self.assertFalse(m.passed)
35        self.assertTrue(any("groundedness" in f for f in m.failures))
36
37    def test_any_forbidden_action_fails_gate(self):
38        m = Metrics.compute("golden", _routing(10, 10, 10), _quality(10, [0.9] * 10, violations=1))
39        self.assertEqual(m.safety_violations, 1)
40        self.assertFalse(m.passed)
41
42    def test_intent_accuracy_reported_separately(self):
43        # Intent can diverge from category without failing the gate.
44        m = Metrics.compute("golden", _routing(9, 9, 10, intent_hits=4), _quality(10, [0.8] * 10))
45        self.assertEqual(m.routing_accuracy, 0.9)
46        self.assertEqual(m.intent_accuracy, 0.4)

Each test encodes a business rule, not just behavior. test_low_answer_score_is_not_gated would fail if someone added answer_score to the gated failures list — which is exactly what you want: the test catches that change and forces a decision.

Step 11: The CI Eval Gate

The full CI pipeline has five jobs. The eval-gate job is the heavy one — it boots infra, pulls models, ingests, and runs the harness.

.github/workflows/ci.yml

yaml

1# .github/workflows/ci.yml
2jobs:
3  go:
4    name: Go API (gqlgen + build + test)
5    runs-on: ubuntu-latest
6    steps:
7      - uses: actions/checkout@v4
8      - uses: actions/setup-go@v5
9        with:
10          go-version: "1.25"
11      - name: Regenerate gqlgen (schema-first; generated code is not committed)
12        working-directory: services/api
13        run: go run github.com/99designs/gqlgen generate
14      - run: go build ./...
15      - run: go test ./...
16
17  python:
18    name: Python (worker + eval + pipeline tests)
19    runs-on: ubuntu-latest
20    steps:
21      - uses: actions/checkout@v4
22      - uses: actions/setup-python@v5
23        with:
24          python-version: "3.12"
25      - run: pip install -r workers/agent/requirements.txt
26      - name: Worker node + graph tests
27        working-directory: workers/agent
28        run: |
29          python tests/test_nodes.py
30          python tests/test_graph.py
31          python tests/test_rag.py
32      - name: Eval harness tests
33        working-directory: pipeline
34        run: python eval/tests/test_eval.py
35
36  eval-gate:
37    name: Sampled eval gate (groundedness/routing/safety)
38    runs-on: ubuntu-latest
39    timeout-minutes: 40
40    steps:
41      - uses: actions/checkout@v4
42      - name: Configure env (single small model for all tiers)
43        run: |
44          cp .env.example .env
45          cat >> .env <<'EOF'
46          TRIAGE_MODEL=qwen2.5:3b
47          DRAFT_MODEL=qwen2.5:3b
48          JUDGE_MODEL=qwen2.5:3b
49          INGEST_LIMIT=4000
50          EVAL_ROUTING_SAMPLE=12
51          EVAL_DRAFT_SAMPLE=3
52          EOF
53      - name: Boot infra
54        run: docker compose -f deploy/docker-compose.yml up -d postgres redis ollama
55      - name: Pull models
56        run: |
57          docker compose -f deploy/docker-compose.yml exec -T ollama ollama pull nomic-embed-text
58          docker compose -f deploy/docker-compose.yml exec -T ollama ollama pull qwen2.5:3b
59      - name: Migrate + ingest KB and golden set
60        run: |
61          docker compose -f deploy/docker-compose.yml run --rm migrate
62          docker compose -f deploy/docker-compose.yml --profile tools run --rm --build pipeline \
63            python ingest_bitext.py
64      - name: Run eval gate
65        run: docker compose -f deploy/docker-compose.yml --profile tools run --rm --build eval
66      - name: Publish report
67        if: always()   # upload even on failure so you can read what broke
68        uses: actions/upload-artifact@v4
69        with:
70          name: eval-report
71          path: pipeline/eval/reports/REPORT.md

Key choices in the CI config:

EVAL_ROUTING_SAMPLE=12 / EVAL_DRAFT_SAMPLE=3 — CI uses reduced samples to keep the job under 40 minutes on a free runner. Local make eval runs the full defaults (30/8).
qwen2.5:3b for all three models — triage, draft, and judge all use the same small model in CI. Local dev can point DRAFT_MODEL at something stronger.
if: always() on the report upload — you get REPORT.md as a build artifact even when the gate fails, so you can read the failure messages without SSH-ing into the runner.

The Eval Flow End-to-End

Eval Flow

Running It Locally

bash

1# First-time setup
2docker compose up -d postgres redis ollama
3make ingest          # ~3 min with INGEST_LIMIT=4000
4make eval            # runs both tiers; prints report; exits 0 on pass

Output sample:

text

1loaded 874 golden cases; routing=30 draft=8
2scoring routing (triage + retrieve)...
3  routing 30/30
4scoring quality (full graph + judge)...
5  quality 1/8 done
6  quality 2/8 done
7  ...
8
9# Resolver eval report
10- **Verdict:** ✅ PASS
11| Routing accuracy (category) | 93.33% | ≥ 85% |
12| Groundedness | 100.00% | ≥ 90% |
13| Safety violations | 0 | 0 |
14
15eval_run id: 01JXYZ...  ·  report: /out/REPORT.md

What We Have

text

1pipeline/
2├── dataset.py          — stratified split, KB doc dedup, golden.jsonl writer
3├── ingest_bitext.py    — 6-step idempotent ingest, sanity-query after upsert
4└── eval/
5    ├── cases.py        — GoldenCase frozen dataclass, deterministic sample
6    ├── metrics.py      — two result types, Metrics.compute, p95, gate logic
7    ├── rubric.py       — judge prompt, parse_judge (defensive, testable standalone)
8    ├── report.py       — render_markdown (pure), persist to eval_runs
9    ├── run_eval.py     — two-tier main, exit 0 / 1
10    └── tests/
11        └── test_eval.py — gate tests encoding business rules, no I/O needed

Three hard gates protect the production-critical properties: routing accuracy (right queue), groundedness (no fabricated citations), safety (zero forbidden actions). Everything else — answer quality, latency, cost — is tracked for trend analysis without blocking a merge.

Let's Build a Customer Support AI Copilot: An Event-Driven Agent with LangGraph, Go, pgvector & Redis Streams [Part 4]

Part 4 — Closing the Loop: The Eval Harness

The Core Problem: Measuring Generalization, Not Memorization

Step 1: The Ingest Pipeline

Step 2: The Golden Set

Step 3: Two-Tier Eval Architecture

Step 4: Per-Case Result Types

Step 5: Two Eval Functions

Step 6: The LLM-as-Judge Rubric

Step 7: Metrics Aggregation and Gating

Step 8: Markdown Report and Persistence

Step 9: The Main Loop

Step 10: Unit Tests — Encoding Why

Step 11: The CI Eval Gate

The Eval Flow End-to-End

Running It Locally

What We Have

Let's Build a Customer Support Co-Pilot

Ready to Build Something Extraordinary?

More from the Blog

Let's Build a Print-Ready Die-Cut Sticker SaaS from scratch in Golang & Next.js [Part 6]

Let's Build a Print-Ready Die-Cut Sticker SaaS from scratch in Golang & Next.js [Part 5]

Let's Build a Print-Ready Die-Cut Sticker SaaS from scratch in Golang & Next.js [Part 4]

Let's Build a Print-Ready Die-Cut Sticker SaaS from scratch in Golang & Next.js [Part 3]