25 verification environments. Six capability domains.
Production-grade evaluation environments for math reasoning, code execution, SQL execution, tool orchestration, long-context reasoning, and scientific reconstruction. Every task generated fresh, every reward calibrated. One CLI flag away. Filter by capability to find the variant your team needs.

Math Algebra
Procedurally generated algebra problems with closed-form symbolic answers. Sympy ground truth.

Math Algebra
Multi-turn algebra: agent receives step-level feedback and refines its derivation chain.

Math Algebra
Tool-using algebra: agent calls a Sympy CAS for substitution, factoring, and simplification.

Code (HumanEval-style)
Procedurally generated HumanEval-style functions verified by sandboxed unit tests.

Code (HumanEval-style)
Multi-turn coding: agent receives failing-test feedback per round and refines its implementation.

Code (HumanEval-style)
Tool-using coding: agent calls a Python interpreter for incremental verification before commit.

Mini-Repo Refactor
Multi-file refactor across a small repo. Tests on patched files verify behaviour preservation.

SQL Query
Generate SQL against a real schema; reward via execution-result equivalence vs gold query.

SQL Query
Multi-turn SQL: agent inspects intermediate result rows and refines its query iteratively.

Tool Calling
Single-shot tool selection + parameter completion. Ground truth via tool-execution outcome.

Tool Calling
Multi-step tool orchestration: agent chains 3-5 tool calls toward a verified end-state.

Tool Calling (debug)
Debug-resolution chains: trace through failing tool calls and produce a verified fix.

Long Context (Needle)
Needle-in-haystack retrieval over up to 200K-token contexts. Substring-match ground truth.

Long Context (Synthesis)
Cross-passage synthesis over long documents. Rouge-L and semantic-equivalence scoring.

Long Context (Reasoning)
Multi-hop reasoning chains across distant passages. Final-answer + chain-validity scoring.

Sparse Fourier Recovery
Recover a k-sparse signal from m noisy Fourier measurements. Closed-form ground truth via OMP and L1.

Sparse Fourier Recovery
Multi-turn variant: agent iteratively refines support estimates with residual feedback per round.

Sparse Fourier Recovery
Tool-using variant: agent calls FFT, soft-threshold, and least-squares primitives directly.

CT Reconstruction (LoDoPaB)
Reconstruct low-dose CT slices from sparse-view sinograms. FBP and TV-regularised baselines.

CT Reconstruction (LoDoPaB)
Multi-turn LoDoPaB: agent iterates over filter / regulariser choices with PSNR feedback.

MRI Knee (fastMRI)
Reconstruct knee MRI from undersampled k-space at 4× and 8× acceleration.

MRI Knee (fastMRI)
Multi-turn fastMRI: agent refines coil-combine and regulariser choices over rounds.

Phase Retrieval
Recover phase from intensity-only measurements. HIO and Fienup-style baselines.

Phase Retrieval
Multi-turn phase retrieval: agent tunes support and shrinkage parameters across rounds.

Super-Resolution (DIV2K ×4)
4× upscaling of natural images. Bicubic, ESPCN, and SRCNN baselines for PSNR/SSIM scoring.