Verifiable RL environments forscientific AI agents.

Open-source SDK for training and evaluating LLM agents on scientific inverse problems. Ten environments, conformal-calibrated rewards, fresh instances every run.

Get started View on GitHub

$pip install verifiable-labspypi

DOI 10.5281/zenodo.19786415·OpenReview 4kQ17M7jeg·Live on Prime Intellect Hub

10environments

5frontier models

50comparisons

32significant

5scientific domains

Why it works

Three primitives. Zero gameable signal.

Verifiable Labs sits on three orthogonal guarantees. Together they make benchmark gaming structurally infeasible.

Mathematical ground truth

Every environment is an inverse problem with a closed-form or provably convergent classical solver. The reward is a measurable distance, so there is nothing to game.

› reward = ‖x̂ − x‖₂ / ‖x‖₂

Procedural regeneration

Problem instances are sampled at runtime from physical priors. By construction the model has never observed the exact instance, so contamination is structurally impossible.

› x = sample(prior, seed=run_id)

Conformal-calibrated rewards

Reward intervals are calibrated to a target coverage with split-conformal prediction. You get verifiable bounds rather than point estimates.

› interval = q̂_{1−α}, P(y∈I) ≥ 1−α

The substrate for verifiable RL

Built to help AI labs ship reproducible, contamination-proof experiments.

Four primitives that fit straight into your post-training, evaluation, or red-team pipeline. No infra rewrite required.

Procedural regeneration

Fresh problem instances on every run.

Each episode samples a brand-new inverse problem from a physical prior. By construction the model has never observed this exact instance, so contamination is structurally impossible rather than statistically unlikely.

Conformal calibration

Verifiable rewards, not vibe checks.

Every reward ships with a split-conformal interval and a documented coverage target. Audit empirical coverage against that target on every run, and the harness fails loudly the moment calibration drifts.

Frontier-model native

Any model, one CLI flag.

Swap Anthropic, OpenAI, Google, or a local model with a single argument. The protocol, the seeds, and the conformal target stay identical, so the same harness drops into any post-training or eval pipeline.

Open and peer-reviewed

Apache-2.0 code. Frozen DOI. Live hub.

The protocol lives on Zenodo with a frozen DOI. The peer-review thread is on OpenReview. The environments run live on Prime Intellect. Every piece is auditable, citable, and reproducible without an account.

Domains

Powering scientific reasoning across five inverse-problem domains.

Each domain ships a closed-form classical solver and procedural variants. Memorization is mathematically impossible.

Sparse Fourier Recovery

Recover k-sparse signals from noisy Fourier samples. Three variants from single-shot to tool-using.

Open env

CT Reconstruction (LoDoPaB)

Low-dose CT slice reconstruction from sparse-view sinograms with FBP / TV baselines.

Open env

MRI Knee (fastMRI)

Knee MRI reconstruction from undersampled k-space at 4× and 8× acceleration.

Open env

Phase Retrieval

Recover phase from intensity-only measurements. HIO / Fienup baselines, hard for LLMs.

Open env

Super-Resolution DIV2K

4× upscaling on the DIV2K natural-image benchmark with PSNR + SSIM scoring.

Open env

Platform

Built on a rigorous foundation.

Four primitives engineered together. Contamination, Goodhart drift, and silent regressions stay out of reach by construction.

Read the paper

Closed-form ground truth

Every environment is an inverse problem with a classical solver that has provable convergence. Fienup, FBP, TV-regularized least-squares, OMP. The reward is a measurable distance to that solver, so there is nothing for a graded preference to drift on.

Procedural problem generator

Each run samples a fresh instance from a physical prior at runtime. By construction the model has never observed this exact problem before, so contamination is mechanically impossible rather than statistically unlikely.

Provider-agnostic SDK

Swap Anthropic, OpenAI, Google, or local with one CLI flag. The protocol, the seeds, and the conformal target stay identical across providers, so the same harness drops into any post-training or eval pipeline.

Distribution-free coverage

Split-conformal calibration ships a marginal coverage guarantee at every reward step. Pick a target, audit empirical coverage on every run, and fail loudly when calibration drifts.