Verifiable RL environments forscientific AI agents.

Open-source SDK for training and evaluating LLM agents on scientific inverse problems. Ten environments, conformal-calibrated rewards, fresh instances every run.

$pip install verifiable-labspypi
DOI 10.5281/zenodo.19786415·OpenReview 4kQ17M7jeg·Live on Prime Intellect Hub
10environments
5frontier models
50comparisons
32significant
5scientific domains
Why it works

Three primitives. Zero gameable signal.

Verifiable Labs sits on three orthogonal guarantees. Together they make benchmark gaming structurally infeasible.

Mathematical ground truth

Every environment is an inverse problem with a closed-form or provably convergent classical solver. The reward is a measurable distance, so there is nothing to game.

reward = ‖x̂ − x‖₂ / ‖x‖₂

Procedural regeneration

Problem instances are sampled at runtime from physical priors. By construction the model has never observed the exact instance, so contamination is structurally impossible.

x = sample(prior, seed=run_id)

Conformal-calibrated rewards

Reward intervals are calibrated to a target coverage with split-conformal prediction. You get verifiable bounds rather than point estimates.

interval = q̂_{1−α}, P(y∈I) ≥ 1−α
The substrate for verifiable RL

Built to help AI labs ship reproducible, contamination-proof experiments.

Four primitives that fit straight into your post-training, evaluation, or red-team pipeline. No infra rewrite required.

Procedural regeneration

Fresh problem instances on every run.

Each episode samples a brand-new inverse problem from a physical prior. By construction the model has never observed this exact instance, so contamination is structurally impossible rather than statistically unlikely.

Conformal calibration

Verifiable rewards, not vibe checks.

Every reward ships with a split-conformal interval and a documented coverage target. Audit empirical coverage against that target on every run, and the harness fails loudly the moment calibration drifts.

Frontier-model native

Any model, one CLI flag.

Swap Anthropic, OpenAI, Google, or a local model with a single argument. The protocol, the seeds, and the conformal target stay identical, so the same harness drops into any post-training or eval pipeline.

Open and peer-reviewed

Apache-2.0 code. Frozen DOI. Live hub.

The protocol lives on Zenodo with a frozen DOI. The peer-review thread is on OpenReview. The environments run live on Prime Intellect. Every piece is auditable, citable, and reproducible without an account.

Platform

Built on a rigorous foundation.

Four primitives engineered together. Contamination, Goodhart drift, and silent regressions stay out of reach by construction.

Closed-form ground truth

Every environment is an inverse problem with a classical solver that has provable convergence. Fienup, FBP, TV-regularized least-squares, OMP. The reward is a measurable distance to that solver, so there is nothing for a graded preference to drift on.

Procedural problem generator

Each run samples a fresh instance from a physical prior at runtime. By construction the model has never observed this exact problem before, so contamination is mechanically impossible rather than statistically unlikely.

Provider-agnostic SDK

Swap Anthropic, OpenAI, Google, or local with one CLI flag. The protocol, the seeds, and the conformal target stay identical across providers, so the same harness drops into any post-training or eval pipeline.

Distribution-free coverage

Split-conformal calibration ships a marginal coverage guarantee at every reward step. Pick a target, audit empirical coverage on every run, and fail loudly when calibration drifts.

All ten environments. One CLI flag.

Five frontier models. Real classical baselines. Reproducible scores. Filter by scientific domain.

Methodology and guarantees.

Every claim has a citation. Every number has a confidence interval.


Empirical coverage of 0.901 ± 0.017 against a 0.90 target, distribution-free and finite-sample. Run --episodes 100 yourself; the calibration set is shipped with each environment.


Every run records the seed and prior parameters. Re-run on a different machine with the same flags and get bit-identical instances. No silent seed drift between releases.


Each environment ships its solver source, hyperparameters, and a provenance log. Audit any reward in under three minutes against a reference implementation.


Code is Apache-2.0, paper is CC-BY-4.0. Frozen DOI on Zenodo, open peer-review thread on OpenReview, live environments on Prime Intellect Hub.


SDK

Frontier models. Ten years of research. Four minutes.

One CLI. One reproducible seed. Conformal coverage guarantees printed on every run. Drop into your existing post-training pipeline.

  • Deterministic seeds. The same flag yields a bit-identical instance on any machine.
  • Documented coverage. Every reward ships with a calibrated split-conformal interval.
  • JAX-native. One run takes minutes on a single H100 instead of hours on a node.
bench.sh
$ pip install verifiable-labs
$ verifiable run --env mri-knee-reconstruction \
    --model claude-opus-4 --seed 42

 Loading environment...
 Calibrating conformal interval (target 0.90)...
 Running 100 episodes...

Mean reward:   0.488 ± 0.042
Coverage:      0.901 (target 0.90) ✓
Time:          4m 23s · Cost: $0.18
Backed by research

Reproducible. Archived. Peer-reviewed.

The protocol lives in three persistent records. Artifacts on Zenodo. Peer review on OpenReview. Environments live on Prime Intellect.

Stop benchmarking on contaminated data.

Ten environments. Five frontier-model baselines. Conformal-calibrated rewards. All open source, peer-reviewed, and ready to run.

Run your first benchmark in under five minutes.