Guide

Writing an eval suite

Cases, evaluators, thresholds — the structure of a regression suite.

A suite is a declarative bundle of cases plus evaluators plus a threshold. The registry runs it on every promotion and refuses to move the live pointer when any threshold fails.

#Anatomy

version: 0.1.0
schema: aldo-ai/eval.v1
suite:
  name: changelog-quality
  cases_dataset: dataset:changelog-cases-v1
  evaluators:
    - name: rubric
      type: llm-judge
      capability_class: reasoning-large
      rubric: |
        Score 1-5: clarity, completeness, accuracy.
    - name: must-mention-version
      type: contains
      ref: version-tag
  threshold:
    rubric: 4.0
    must-mention-version: 1.00

#Cases

Cases live in datasets. Every row is {input, ground_truth?, tags?}. Upload a dataset with the aldo dataset push CLI or via POST /v1/datasets.

#Picking evaluators

  • Use contains/regex/exact when you have a hard ground truth. Cheap, fast, deterministic.
  • Use llm-judge when the rubric is qualitative. Always declare it by capability class, never by model name.
  • Use script when the rubric is computable but not regex expressible (e.g. the response is JSON; check that all required fields are present and well-formed).

#Thresholds

Thresholds are per evaluator. The suite passes only if every evaluator's mean score over the case set meets the threshold. There is no aggregation across evaluators — each is its own gate.

#Iterating

Run the suite ad-hoc with:

aldo eval run --suite ./changelog-quality.yaml

Or on the Sweeps page in the control plane, against a matrix of models, to see how each capability class performs.

#Versioning

Suites are versioned the same way agents are. Bumping a suite without changing the agent is fine — promotions read the latest suite version.

Source: apps/web/content/docs/guides/writing-an-eval-suite.md
Was this helpful?
Writing an eval suite · ALDO AI Docs