The verification and audit infrastructure layer for AI. Fresh evaluation tasks, objective rewards, calibrated uncertainty, and signed reports. So every model claim can be tested, reproduced, and trusted.

25 evaluation envs6 capability domainssigned receiptsself-hosted or VPCApache-2.0 SDK

verifiable-labs · auditv0.1.0a
command
$verifiable audit\--env sparse-fourier-recovery--model claude-haiku-4.5--episodes 30
statuscalibrated · 1m 24s
envreward · coverage
math-algebra
0.740·89.1 %
code-humaneval
0.790·91.2 %
sql-multiturn
0.820·90.4 %
sparse-fourier-recovery
0.364·88.7 %
30 episodesconformal 0.901 ± 0.017

Verification protocol validated across five model families

Claude (Anthropic)MicrosoftQwen (Alibaba)Meta (Llama)Google (Gemma)
25evaluation environments
6capability domains
654CI checks per release
90%calibrated coverage target
<5 minto first signed audit
Contamination-resistant tasks

Every evaluation task is generated fresh from a parametric prior. No memorisation. No leakage. No replays.

Enterprise deployment

Self-hosted or VPC. Founder-led integration into your training, procurement, and release-gate workflows.

0 leaked benchmarks

every task procedurally generated · contamination structurally impossible

Evaluation calls in production

Calibration and verification calls, last six months. Every run is a traced, signed receipt, visible to whoever holds the API key.

The verification layer

Static benchmarks give you a score. We give you evidence.

AI teams don't need another scoreboard. They need reproducible reports their post-training leads, procurement teams, and compliance reviewers can verify, across six capability domains.

Verifiable Labsverification infrastructureStatic benchmarksfrozen scoreboardsHuman-label vendorslabelled training dataGeneric eval harnessrun-your-own evalsInternal PDFsone-off reports
Coverage
Live evaluation environments
25
Frozen
Custom-built
Plug your own
Ad hoc
Capability domains covered
6
1–2
1–2
1–3
1
Evidence quality
Calibrated uncertainty on every reward
Contamination-resistant by construction
Frontier-model baselines on every task
Procurement & release gates
Signed audit receipts (X.509)
Reproducible reports linked to every seed
Open-source SDK (Apache-2.0)

Competitor data based on public marketing as of 2026. Corrections welcome via GitHub issues.

vlabs-audit

Signed model reports in minutes.

vlabs-audit turns any model evaluation into a reproducible, signed PDF. Every chart, score, seed, and trace links back to source evidence. Export for procurement, safety review, investor diligence, or internal release gates.

One signed audit, in numbers
3envs sampled
15episodes total
<5 minwall-clock per audit
18 ppanonymised PDF
Example auditClaude Haiku 4.5See full report
Audit is the wedge · Reward infrastructure is the platform

From audits to training signals.

Every Verifiable Labs run produces more than a report. Fresh task instances, objective rewards, calibrated uncertainty, paired traces, and failure-mode taxonomy. Use it first to verify model claims. Then plug the same reward layer into your post-training workflows to improve models with evidence you can trust.

Audit outputTraining input
Signed PDF reportProcurement, compliance, board review
Paired JSONL traces (base + LoRA, identical seeds)Reward-model training data, proprietary by design
Calibrated reward intervals (90 % conformal coverage)CI release gates, post-training reward function
Failure-mode taxonomy (parse, format, semantic)Curriculum and target distribution for the next run

The first product is audit-grade verification. The platform is reward infrastructure for post-training. Same protocol, same calibration, same traces. Procurement signs the receipt; your training loop consumes the reward.

SDK

One CLI. Every model. Calibrated by default.

Fast install, clean API, local-first. Drop the same verifiable audit call into any post-training pipeline or CI release gate. No account required. Same trace shape across all 25 environments.

  • math
  • code
  • sql
  • tools
  • long-context
  • imaging

Six capability domains. One interface.

Math reasoning, code execution, SQL execution, tool orchestration, long-context reasoning, scientific reconstruction. Same verifiable audit call, same trace shape, same calibrated reward semantics. Switch capability domains by changing one flag.

Bounded evidence on every reward.

Know when a model is actually better, not just noisier. Every score ships with calibrated uncertainty. When we say “0.488”, we also tell you true reward ∈ [0.446, 0.530] with 90 % probability. Make release decisions on bounded evidence, not point estimates.

--multiturn--seed 84291
Yesterday 14:24 UTC is the latest CI-tested commit.
FAQ

Buyer questions, answered straight.

What customers, procurement teams, and investors ask before signing. Missing one? Email [email protected]. We'll add it.

Model evaluations, end to end. 25 environments across six capability domains: math reasoning, code execution, SQL execution, tool orchestration, long-context reasoning, and scientific reconstruction. Every environment ships objective ground truth (executable or closed-form) and calibrated uncertainty on every reward, so you know how reliable each score actually is.

Replace your scoreboard with evidence.

Run a signed audit in five minutes. Open-source SDK, hosted, or self-hosted in your VPC.