Concept
Eval harness
Suites, sweeps, evaluators, datasets — the engine for promotion gating.
Promotion in ALDO AI is eval-gated. The registry refuses to promote a new version unless every suite the spec names passes its declared threshold. Here is the model.
#Suites
A suite is a declarative bundle of cases plus evaluators plus a threshold. It is itself versioned and lives in the registry.
suite:
name: code-review
cases_dataset: dataset:code-review-v3
evaluators:
- name: rubric
type: llm-judge
capability_class: reasoning-large
- name: contains
type: contains
ref: ground-truth
threshold:
rubric: 0.85
contains: 1.00#Cases
A case is a single input (and optional ground-truth) the suite runs the agent against. Cases live in datasets — see Dataset uploads.
#Evaluators
Built-in evaluator types:
contains— the response contains the expected string.regex— the response matches a regex.exact— the response equals the expected value.llm-judge— score with another agent (configured by capability class, not model name).script— run a sandboxed JS evaluator.
Custom evaluators are first-class — see Custom evaluators.
#Sweeps
A sweep runs a suite against a matrix of model × spec. Use it to compare a frontier cloud model and a local model on the same agent spec — the canonical "should we ship a local-only build?" question.
The sweeps page renders a radar chart per evaluator and a bar chart of total cost so the trade-off is visible.
#Promotion gate
The registry's promote endpoint runs every named suite and only
flips the live pointer when every threshold passes. A passing
sweep on a non-named suite does NOT count — the spec must declare
which suites are blocking.