Guide
Writing an eval suite
Cases, evaluators, thresholds — the structure of a regression suite.
A suite is a declarative bundle of cases plus evaluators plus a threshold. The registry runs it on every promotion and refuses to move the live pointer when any threshold fails.
#Anatomy
version: 0.1.0
schema: aldo-ai/eval.v1
suite:
name: changelog-quality
cases_dataset: dataset:changelog-cases-v1
evaluators:
- name: rubric
type: llm-judge
capability_class: reasoning-large
rubric: |
Score 1-5: clarity, completeness, accuracy.
- name: must-mention-version
type: contains
ref: version-tag
threshold:
rubric: 4.0
must-mention-version: 1.00#Cases
Cases live in datasets. Every row is {input, ground_truth?, tags?}.
Upload a dataset with the aldo dataset push CLI or via
POST /v1/datasets.
#Picking evaluators
- Use
contains/regex/exactwhen you have a hard ground truth. Cheap, fast, deterministic. - Use
llm-judgewhen the rubric is qualitative. Always declare it by capability class, never by model name. - Use
scriptwhen the rubric is computable but not regex expressible (e.g. the response is JSON; check that all required fields are present and well-formed).
#Thresholds
Thresholds are per evaluator. The suite passes only if every evaluator's mean score over the case set meets the threshold. There is no aggregation across evaluators — each is its own gate.
#Iterating
Run the suite ad-hoc with:
aldo eval run --suite ./changelog-quality.yamlOr on the Sweeps page in the control plane, against a matrix of models, to see how each capability class performs.
#Versioning
Suites are versioned the same way agents are. Bumping a suite without changing the agent is fine — promotions read the latest suite version.