Environments

25 verification environments. Six capability domains.

Production-grade evaluation environments for math reasoning, code execution, SQL execution, tool orchestration, long-context reasoning, and scientific reconstruction. Every task generated fresh, every reward calibrated. One CLI flag away. Filter by capability to find the variant your team needs.

Single-turn

Symbolic Math

Math Algebra

Procedurally generated algebra problems with closed-form symbolic answers. Sympy ground truth.

Ground truth1.000

Verified score0.740GPT-5

+0.260 vs baselinePrime Intellect

Multi-turnPreview

Symbolic Math

Math Algebra

Multi-turn algebra: agent receives step-level feedback and refines its derivation chain.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.

Preview · baseline in progressPrime Intellect

Tool-usingPreview

Symbolic Math

Math Algebra

Tool-using algebra: agent calls a Sympy CAS for substitution, factoring, and simplification.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.

Preview · baseline in progressPrime Intellect

Single-turnPreview

Code Execution

Code (HumanEval-style)

Procedurally generated HumanEval-style functions verified by sandboxed unit tests.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.

Preview · baseline in progressPrime Intellect

Multi-turnPreview

Code Execution

Code (HumanEval-style)

Multi-turn coding: agent receives failing-test feedback per round and refines its implementation.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.

Preview · baseline in progressPrime Intellect

Tool-usingPreview

Code Execution

Code (HumanEval-style)

Tool-using coding: agent calls a Python interpreter for incremental verification before commit.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.

Preview · baseline in progressPrime Intellect

Multi-turnPreview

Code Execution

Mini-Repo Refactor

Multi-file refactor across a small repo. Tests on patched files verify behaviour preservation.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.

Preview · baseline in progressPrime Intellect

Single-turnPreview

SQL Execution

SQL Query

Generate SQL against a real schema; reward via execution-result equivalence vs gold query.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.

Preview · baseline in progressPrime Intellect

Multi-turnPreview

SQL Execution

SQL Query

Multi-turn SQL: agent inspects intermediate result rows and refines its query iteratively.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.

Preview · baseline in progressPrime Intellect

Single-turnPreview

Tool Orchestration

Tool Calling

Single-shot tool selection + parameter completion. Ground truth via tool-execution outcome.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.

Preview · baseline in progressPrime Intellect

Multi-turnPreview

Tool Orchestration

Tool Calling

Multi-step tool orchestration: agent chains 3-5 tool calls toward a verified end-state.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.

Preview · baseline in progressPrime Intellect

Tool-usingPreview

Tool Orchestration

Tool Calling (debug)

Debug-resolution chains: trace through failing tool calls and produce a verified fix.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.

Preview · baseline in progressPrime Intellect

Single-turnPreview

Long-Context Reasoning

Long Context (Needle)

Needle-in-haystack retrieval over up to 200K-token contexts. Substring-match ground truth.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.

Preview · baseline in progressPrime Intellect

Single-turnPreview

Long-Context Reasoning

Long Context (Synthesis)

Cross-passage synthesis over long documents. Rouge-L and semantic-equivalence scoring.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.

Preview · baseline in progressPrime Intellect

Single-turnPreview

Long-Context Reasoning

Long Context (Reasoning)

Multi-hop reasoning chains across distant passages. Final-answer + chain-validity scoring.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.

Preview · baseline in progressPrime Intellect

Single-turn

Compressed Sensing

Sparse Fourier Recovery

Recover a k-sparse signal from m noisy Fourier measurements. Closed-form ground truth via OMP and L1.

Ground truth0.852

Verified score0.364Claude Haiku 4.5

+0.488 vs baselinePrime Intellect

Multi-turn

Compressed Sensing

Sparse Fourier Recovery

Multi-turn variant: agent iteratively refines support estimates with residual feedback per round.

Ground truth0.852

Verified score0.365GPT-5.4

+0.487 vs baselinePrime Intellect

Tool-using

Compressed Sensing

Sparse Fourier Recovery

Tool-using variant: agent calls FFT, soft-threshold, and least-squares primitives directly.

Ground truth0.852

Verified score0.368GPT-5.4 Nano

+0.484 vs baselinePrime Intellect

Single-turn

Medical Imaging

CT Reconstruction (LoDoPaB)

Reconstruct low-dose CT slices from sparse-view sinograms. FBP and TV-regularised baselines.

Ground truth0.703

Verified score0.667Claude Opus 4.7

+0.036 vs baselinePrime Intellect

Multi-turn

Medical Imaging

CT Reconstruction (LoDoPaB)

Multi-turn LoDoPaB: agent iterates over filter / regulariser choices with PSNR feedback.

Ground truth0.703

Verified score0.667Claude Opus 4.7

+0.036 vs baselinePrime Intellect

Single-turn

Medical Imaging

MRI Knee (fastMRI)

Reconstruct knee MRI from undersampled k-space at 4× and 8× acceleration.

Ground truth0.688

Verified score0.675Claude Haiku 4.5

+0.013 vs baselinePrime Intellect

Multi-turn

Medical Imaging

MRI Knee (fastMRI)

Multi-turn fastMRI: agent refines coil-combine and regulariser choices over rounds.

Ground truth0.688

Verified score0.683Claude Haiku 4.5

+0.005 vs baselinePrime Intellect

Single-turn

Crystallography

Phase Retrieval

Recover phase from intensity-only measurements. HIO and Fienup-style baselines.

Ground truth0.289

Verified score0.455Claude Haiku 4.5

-0.166 vs baselinePrime Intellect

Multi-turn

Crystallography

Phase Retrieval

Multi-turn phase retrieval: agent tunes support and shrinkage parameters across rounds.

Ground truth0.289

Verified score0.353GPT-5.4 Nano

-0.064 vs baselinePrime Intellect

Single-turn

Image Processing

Super-Resolution (DIV2K ×4)

4× upscaling of natural images. Bicubic, ESPCN, and SRCNN baselines for PSNR/SSIM scoring.

Ground truth0.728

Verified score0.733Claude Opus 4.7

-0.005 vs baselinePrime Intellect