Solutions

A release gate for every team shipping agents.

From a single prompt change to a whole RL checkpoint, add a clear ship / limit / block decision before changes reach users.

Book a demo Watch demo

One layer, every change

Every prompt, tool, or model change is a release.

Whoever ships it, the question is the same: does this hold beyond the tests we can see — and did it earn the score or game the verifier? We answer it the same way for every team.

Agent product teams

Ship prompt and tool changes without guessing

Every prompt tweak, model swap, or new tool is a release. Run it through the gate first for a clear ship / limit / block call — not a vibe check.

Catch regressions a public eval can't see
Block changes that overfit before users feel them
An immutable, reviewable record behind every release

Assurance Card · example

decision: BLOCK
reasons: ood_regressed · hidden_regressed
public: 0.740 → 0.910
hidden: 0.732 → 0.611
ood: 0.701 → 0.488
record: redacted · reviewable

✗ candidate not promoted

RL & fine-tuning teams

Certify the verifier before your model games it

GRPO and RFT reward the shortcut the moment one exists. Certify the verifier's gameability up front, then re-grade every checkpoint without an LLM judge — Isomorphic Perturbation Testing (IPT) catches solution-level hardcoding, a trajectory scan catches test/grader tampering.

Drops into your RLVR/RFT reward as a deterministic grader — no LLM judge
Re-checks wins on unseen task variants + structural hardcode detectors + trajectory tamper scan
Runs in your trust domain — reference and tests stay local

Eval, safety & enterprise teams

Prove what shipped, without exposing what's private

Give security and compliance an immutable, redacted record for every release — with a named-human sign-off on high-risk limits.

Redacted assurance card on every decision
Private evaluation boundaries; human approval gate
Bring-your-own-key and private deployment paths available

What every team gets

The same release gate, wherever you ship

Gate policies

Declare which checks block, limit, or stay advisory — and a LIMIT on a high-risk agent waits for a named-human sign-off before anything ships.

CI & training-loop gate

Call the API, drop the grader into your reward loop, or run the GitHub Action to fail the build on a shortcut.

Works with your stack

Runs in your trust domain with no LLM judge in the loop — bring your own models, frameworks, and keys.

Improve what fails. Ship what holds.

Bring a baseline and a candidate.

Book a demo See the demo

vlabs · clean-gate

$ vlabs clean-gate base.json cand.json

Re-graded on regenerated inputs0 model calls

Compared baseline vs candidate

public tests0.74 → 0.91✓ improved

IPT re-grade0.74 → 0.48✗ regressed

decisionBLOCKgamed the public tests

✗ shipped nothing — the gain didn't hold

deterministic · 0 model calls · $0 per scan