Sample report · 18-page PDF, anonymised
vlabs-audit · production audit pipeline

Signed model reports your customers can verify.

Drop in a model. Get a signed, reproducible 18-page PDF in under five minutes. Every chart, score, seed, and trace links back to source evidence. Export for procurement, safety review, investor diligence, or internal release gates.

3envs sampled
15episodes total
<5 minwall-clock per audit
18 ppanonymised PDF
Install

Two pip installs.

vlabs-audit lives in the verifiable-labs-envs monorepo as a sibling of the SDK. Both install editable from the repo root.

install.sh
$ git clone https://github.com/verifiablelabs/verifiable-labs-envs
$ cd verifiable-labs-envs
$ pip install -e packages/verifiable-labs
$ pip install -e tools/vlabs-audit
Run an audit

One command, three environments, 15 episodes.

The CLI shells out to verifiable run, schedules everything in a local SQLite store, then renders the LaTeX/PDF via tectonic. Anonymisation is opt-in. Scales linearly — `--episodes 30` per env runs in ~22 minutes on the same three environments.

vlabs-audit.sh
$ vlabs-audit audit \
    --model anthropic/claude-haiku-4.5 \
    --envs sparse-fourier-recovery,phase-retrieval,mri-knee-reconstruction \
    --episodes 5 \
    --alpha 0.1 \
    --parallel 4 \
    --anonymize \
    --output report.pdf

3 envs × 5 episodes complete
✓ Wall-clock: 4m 23s
✓ Report: report.pdf (18 pages, anonymised)
What's in the report

Five sections. All numbers traceable to traces.

Every chart and table renders from the same trace JSONL files the SDK emits — anyone with the manifest can re-run the audit and get bit-identical numbers.

Per-environment capability gap

Mean reward, confidence intervals, statistical significance, and effect size on every model-vs-baseline comparison. Every cell traces back to the underlying trace JSONL.

Calibrated uncertainty per environment

Empirical coverage against your target, with calibration plots and a coverage curve. So you know which scores you can trust before you sign them off.

Failure mode taxonomy

Counts and percentages for parse errors, malformed JSON, timeouts, invalid shape — separated from successful-but-wrong predictions.

Performance breakdown

Per-environment wall-clock, per-episode latency distribution, and a budget table — so you can size capacity for a 100× run before you commit.

Reproducibility manifest

Seeds, model IDs, SDK version, environment versions, audit-config hash, and the sub-stage label — everything needed to re-run the audit identically.

Use it for

Three things audits earn their cost on.

The same PDF answers a different question depending on the audience.

Pre-deployment

Pre-deployment validation

Run a 30-episode audit before deploying a new model. Verify parse rates, calibration coverage, and tail behaviour against a fixed reference — no internet contamination possible.

Versioning

Model version comparison

When a provider ships a new checkpoint, compare it against the previous version on identical procedurally-regenerated problems. Deterministic seeds, paired reports.

Compliance

Compliance & due diligence

Anonymised PDFs ready for procurement, compliance, or investor due-diligence packets. Every chart is reproducible from documented seeds.

Add signed audits to your release process.

The sample report is anonymised. The same pipeline produces a non-anonymised version with model and provider names embedded, ready for procurement, compliance, or board review. Open the source, or talk to the founders about an enterprise contract.