Verifiable RL environments forscientific AI agents.
Open-source SDK for training and evaluating LLM agents on scientific inverse problems. Ten environments, conformal-calibrated rewards, fresh instances every run.
Three primitives. Zero gameable signal.
Verifiable Labs sits on three orthogonal guarantees. Together they make benchmark gaming structurally infeasible.
Mathematical ground truth
Every environment is an inverse problem with a closed-form or provably convergent classical solver. The reward is a measurable distance, so there is nothing to game.
Procedural regeneration
Problem instances are sampled at runtime from physical priors. By construction the model has never observed the exact instance, so contamination is structurally impossible.
Conformal-calibrated rewards
Reward intervals are calibrated to a target coverage with split-conformal prediction. You get verifiable bounds rather than point estimates.
Built to help AI labs ship reproducible, contamination-proof experiments.
Four primitives that fit straight into your post-training, evaluation, or red-team pipeline. No infra rewrite required.
Fresh problem instances on every run.
Each episode samples a brand-new inverse problem from a physical prior. By construction the model has never observed this exact instance, so contamination is structurally impossible rather than statistically unlikely.
Verifiable rewards, not vibe checks.
Every reward ships with a split-conformal interval and a documented coverage target. Audit empirical coverage against that target on every run, and the harness fails loudly the moment calibration drifts.
Any model, one CLI flag.
Swap Anthropic, OpenAI, Google, or a local model with a single argument. The protocol, the seeds, and the conformal target stay identical, so the same harness drops into any post-training or eval pipeline.
Apache-2.0 code. Frozen DOI. Live hub.
The protocol lives on Zenodo with a frozen DOI. The peer-review thread is on OpenReview. The environments run live on Prime Intellect. Every piece is auditable, citable, and reproducible without an account.
Powering scientific reasoning across five inverse-problem domains.
Each domain ships a closed-form classical solver and procedural variants. Memorization is mathematically impossible.

Sparse Fourier Recovery
Recover k-sparse signals from noisy Fourier samples. Three variants from single-shot to tool-using.
Open env
CT Reconstruction (LoDoPaB)
Low-dose CT slice reconstruction from sparse-view sinograms with FBP / TV baselines.
Open env
MRI Knee (fastMRI)
Knee MRI reconstruction from undersampled k-space at 4× and 8× acceleration.
Open env
Phase Retrieval
Recover phase from intensity-only measurements. HIO / Fienup baselines, hard for LLMs.
Open env
Super-Resolution DIV2K
4× upscaling on the DIV2K natural-image benchmark with PSNR + SSIM scoring.
Open envBuilt on a rigorous foundation.
Four primitives engineered together. Contamination, Goodhart drift, and silent regressions stay out of reach by construction.
Closed-form ground truth
Every environment is an inverse problem with a classical solver that has provable convergence. Fienup, FBP, TV-regularized least-squares, OMP. The reward is a measurable distance to that solver, so there is nothing for a graded preference to drift on.
Procedural problem generator
Each run samples a fresh instance from a physical prior at runtime. By construction the model has never observed this exact problem before, so contamination is mechanically impossible rather than statistically unlikely.
Provider-agnostic SDK
Swap Anthropic, OpenAI, Google, or local with one CLI flag. The protocol, the seeds, and the conformal target stay identical across providers, so the same harness drops into any post-training or eval pipeline.
Distribution-free coverage
Split-conformal calibration ships a marginal coverage guarantee at every reward step. Pick a target, audit empirical coverage on every run, and fail loudly when calibration drifts.
All ten environments. One CLI flag.
Five frontier models. Real classical baselines. Reproducible scores. Filter by scientific domain.
Methodology and guarantees.
Every claim has a citation. Every number has a confidence interval.
Empirical coverage of 0.901 ± 0.017 against a 0.90 target, distribution-free and finite-sample. Run --episodes 100 yourself; the calibration set is shipped with each environment.
Every run records the seed and prior parameters. Re-run on a different machine with the same flags and get bit-identical instances. No silent seed drift between releases.
Each environment ships its solver source, hyperparameters, and a provenance log. Audit any reward in under three minutes against a reference implementation.
Code is Apache-2.0, paper is CC-BY-4.0. Frozen DOI on Zenodo, open peer-review thread on OpenReview, live environments on Prime Intellect Hub.
Frontier models. Ten years of research. Four minutes.
One CLI. One reproducible seed. Conformal coverage guarantees printed on every run. Drop into your existing post-training pipeline.
- Deterministic seeds. The same flag yields a bit-identical instance on any machine.
- Documented coverage. Every reward ships with a calibrated split-conformal interval.
- JAX-native. One run takes minutes on a single H100 instead of hours on a node.
$ pip install verifiable-labs $ verifiable run --env mri-knee-reconstruction \ --model claude-opus-4 --seed 42 ✓ Loading environment... ✓ Calibrating conformal interval (target 0.90)... ✓ Running 100 episodes... Mean reward: 0.488 ± 0.042 Coverage: 0.901 (target 0.90) ✓ Time: 4m 23s · Cost: $0.18
Reproducible. Archived. Peer-reviewed.
The protocol lives in three persistent records. Artifacts on Zenodo. Peer review on OpenReview. Environments live on Prime Intellect.
Stop benchmarking on contaminated data.
Ten environments. Five frontier-model baselines. Conformal-calibrated rewards. All open source, peer-reviewed, and ready to run.
Run your first benchmark in under five minutes.









