25 environments live on Prime Intellect
Environments

25 verification environments. Six capability domains.

Production-grade evaluation environments for math reasoning, code execution, SQL execution, tool orchestration, long-context reasoning, and scientific reconstruction. Every task generated fresh, every reward calibrated. One CLI flag away. Filter by capability to find the variant your team needs.

Single-turn
Symbolic Math

Math Algebra

Procedurally generated algebra problems with closed-form symbolic answers. Sympy ground truth.

Ground truth1.000
Verified score0.740GPT-5
+0.260 vs baselinePrime Intellect
Multi-turnPreview
Symbolic Math

Math Algebra

Multi-turn algebra: agent receives step-level feedback and refines its derivation chain.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.
Preview · baseline in progressPrime Intellect
Tool-usingPreview
Symbolic Math

Math Algebra

Tool-using algebra: agent calls a Sympy CAS for substitution, factoring, and simplification.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.
Preview · baseline in progressPrime Intellect
Single-turnPreview
Code Execution

Code (HumanEval-style)

Procedurally generated HumanEval-style functions verified by sandboxed unit tests.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.
Preview · baseline in progressPrime Intellect
Multi-turnPreview
Code Execution

Code (HumanEval-style)

Multi-turn coding: agent receives failing-test feedback per round and refines its implementation.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.
Preview · baseline in progressPrime Intellect
Tool-usingPreview
Code Execution

Code (HumanEval-style)

Tool-using coding: agent calls a Python interpreter for incremental verification before commit.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.
Preview · baseline in progressPrime Intellect
Multi-turnPreview
Code Execution

Mini-Repo Refactor

Multi-file refactor across a small repo. Tests on patched files verify behaviour preservation.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.
Preview · baseline in progressPrime Intellect
Single-turnPreview
SQL Execution

SQL Query

Generate SQL against a real schema; reward via execution-result equivalence vs gold query.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.
Preview · baseline in progressPrime Intellect
Multi-turnPreview
SQL Execution

SQL Query

Multi-turn SQL: agent inspects intermediate result rows and refines its query iteratively.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.
Preview · baseline in progressPrime Intellect
Single-turnPreview
Tool Orchestration

Tool Calling

Single-shot tool selection + parameter completion. Ground truth via tool-execution outcome.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.
Preview · baseline in progressPrime Intellect
Multi-turnPreview
Tool Orchestration

Tool Calling

Multi-step tool orchestration: agent chains 3-5 tool calls toward a verified end-state.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.
Preview · baseline in progressPrime Intellect
Tool-usingPreview
Tool Orchestration

Tool Calling (debug)

Debug-resolution chains: trace through failing tool calls and produce a verified fix.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.
Preview · baseline in progressPrime Intellect
Single-turnPreview
Long-Context Reasoning

Long Context (Needle)

Needle-in-haystack retrieval over up to 200K-token contexts. Substring-match ground truth.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.
Preview · baseline in progressPrime Intellect
Single-turnPreview
Long-Context Reasoning

Long Context (Synthesis)

Cross-passage synthesis over long documents. Rouge-L and semantic-equivalence scoring.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.
Preview · baseline in progressPrime Intellect
Single-turnPreview
Long-Context Reasoning

Long Context (Reasoning)

Multi-hop reasoning chains across distant passages. Final-answer + chain-validity scoring.

Calibrated baseline ships with the next platform release. The environment itself ships objective ground truth today.
Preview · baseline in progressPrime Intellect
Single-turn
Compressed Sensing

Sparse Fourier Recovery

Recover a k-sparse signal from m noisy Fourier measurements. Closed-form ground truth via OMP and L1.

Ground truth0.852
Verified score0.364Claude Haiku 4.5
+0.488 vs baselinePrime Intellect
Multi-turn
Compressed Sensing

Sparse Fourier Recovery

Multi-turn variant: agent iteratively refines support estimates with residual feedback per round.

Ground truth0.852
Verified score0.365GPT-5.4
+0.487 vs baselinePrime Intellect
Tool-using
Compressed Sensing

Sparse Fourier Recovery

Tool-using variant: agent calls FFT, soft-threshold, and least-squares primitives directly.

Ground truth0.852
Verified score0.368GPT-5.4 Nano
+0.484 vs baselinePrime Intellect
Single-turn
Medical Imaging

CT Reconstruction (LoDoPaB)

Reconstruct low-dose CT slices from sparse-view sinograms. FBP and TV-regularised baselines.

Ground truth0.703
Verified score0.667Claude Opus 4.7
+0.036 vs baselinePrime Intellect
Multi-turn
Medical Imaging

CT Reconstruction (LoDoPaB)

Multi-turn LoDoPaB: agent iterates over filter / regulariser choices with PSNR feedback.

Ground truth0.703
Verified score0.667Claude Opus 4.7
+0.036 vs baselinePrime Intellect
Single-turn
Medical Imaging

MRI Knee (fastMRI)

Reconstruct knee MRI from undersampled k-space at 4× and 8× acceleration.

Ground truth0.688
Verified score0.675Claude Haiku 4.5
+0.013 vs baselinePrime Intellect
Multi-turn
Medical Imaging

MRI Knee (fastMRI)

Multi-turn fastMRI: agent refines coil-combine and regulariser choices over rounds.

Ground truth0.688
Verified score0.683Claude Haiku 4.5
+0.005 vs baselinePrime Intellect
Single-turn
Crystallography

Phase Retrieval

Recover phase from intensity-only measurements. HIO and Fienup-style baselines.

Ground truth0.289
Verified score0.455Claude Haiku 4.5
-0.166 vs baselinePrime Intellect
Multi-turn
Crystallography

Phase Retrieval

Multi-turn phase retrieval: agent tunes support and shrinkage parameters across rounds.

Ground truth0.289
Verified score0.353GPT-5.4 Nano
-0.064 vs baselinePrime Intellect
Single-turn
Image Processing

Super-Resolution (DIV2K ×4)

4× upscaling of natural images. Bicubic, ESPCN, and SRCNN baselines for PSNR/SSIM scoring.

Ground truth0.728
Verified score0.733Claude Opus 4.7
-0.005 vs baselinePrime Intellect