v0.1.0a4 · April 2026

Documentation

Local-first developer documentation. Five minutes from pip install to first calibrated reward. No account required, no telemetry, no feature gates.

Before you start

You’ll need Python 3.11+, an API key for at least one provider (Anthropic, OpenAI, Google, or OpenRouter), and ~500 MB for the package and its runtime dependencies. Everything runs locally. No Verifiable Labs account required.

Tip
Want to skip the install? Open the Hugging Face Space and run the leaderboard interactively.

Installation

We recommend a fresh virtual environment so the heavy runtime dependencies (numpy, scipy, jax) don’t collide with your system Python.

install.sh
# Recommended: a fresh virtual env (Python 3.11+)
python3.11 -m venv .venv
source .venv/bin/activate

pip install --upgrade pip
pip install verifiable-labs

On macOS, install Python 3.11 with brew install [email protected]. On Debian/Ubuntu, use sudo apt install python3.11 python3.11-venv. On Windows, use WSL2 with Ubuntu.

Verify the SDK with verifiable --version (should print verifiable-labs 0.1.0a4) and verifiable list to see all 25 environments.

Set up an API key

The CLI reads provider keys from environment variables. The model name on the command line picks the provider automatically:

env.sh
# Pick the provider you want to evaluate
export ANTHROPIC_API_KEY=sk-ant-...   # Claude models (claude-haiku-4.5, …)
export OPENAI_API_KEY=sk-...       # GPT / o-series
export GOOGLE_API_KEY=...          # Gemini

# Or use OpenRouter to access any provider with one key
export OPENROUTER_API_KEY=sk-or-v1-...

Prefer interactive setup? verifiable login writes the keys to ~/.verifiable/config.toml with mode 0600 (read/write only by your user). Environment variables always take priority over the file.

Run your first evaluation

Pick an environment, a model, and a seed. The CLI samples a fresh task instance, runs the agent, computes the objective ground truth, and emits a calibrated reward interval you can verify.

run.sh
$ verifiable run \
  --env sparse-fourier-recovery \
  --model claude-haiku-4.5 \
  --episodes 10 \
  --seed 42

Expected output (real numbers; yours will differ slightly):

output.txt
 Loading environment: sparse-fourier-recovery
 Calibrating conformal interval (target 0.90)...
 Running 10 episodes...

Mean reward:   0.327 ± 0.047
Coverage:      0.933 (target 0.90) ✓
Time:          0m 12s · Cost: $0.0021

Trace saved to ~/.verifiable/runs/sparse-fourier-recovery_claude-haiku-4.5_…jsonl

Verify the output

Every run prints a reward, a conformal interval, and a coverage check. Mean reward is the average across episodes (1 = solved, 0 = unsolved). Coverage is the fraction of episodes whose ground truth fell inside the calibrated reward interval (target 0.90 by default). If coverage drops below the target, the calibration set is too small — --episodes 100 stabilises it.

Reproducibility
Pinning --seed guarantees identical problem instances across machines. Seeds are versioned alongside the SDK release and never silently changed.

Python API

Drive environments directly from Python — useful for embedding verifiable rewards inside a training loop or building a custom harness. The same package that ships the CLI also re-exports load_environment and list_environments.

example.py
from verifiable_labs import load_environment, list_environments

# Discover what’s available
print(list_environments())
# ['lodopab-ct-simplified', 'lodopab-ct-simplified-multiturn', ...]

# Load an env and run its classical baseline (no API key required)
env = load_environment("sparse-fourier-recovery")
result = env.run_baseline(seed=42)
print(f"reward: {result['reward']:.3f}")
print(f"components: {result['components']}")  # nmse / support / conformal

# Or generate a single instance and inspect what the LLM sees
instance = env.generate_instance(seed=42)
print(instance.as_inputs().keys())

The same package also exposes the hosted-API client surface (Client, AsyncClient, Environment, Session) for once the public REST API ships in v0.2.

CLI reference

The friendly CLI surface — what most users need.

CommandPurpose
verifiable listPrint all 25 environment slugs, one per line.
verifiable info --env <slug>Show metadata for one env (name, domain, target coverage).
verifiable run --env <slug> --model <name> [--episodes N] [--seed N]Run an LLM agent on the env. Defaults: --episodes 10, --seed 42. Saves a JSONL trace to ~/.verifiable/runs/.
verifiable loginInteractive prompt to write API keys to ~/.verifiable/config.toml mode 0600.
verifiable --versionPrint SDK version.
verifiable --helpList subcommands + global flags.

Lower-level dev commands — envs, compare, report, init-env, validate-env, run --agent FILE --out F.jsonl — also work and are documented in the CLI reference on GitHub.

Provider routing

The model name passed to --model picks the provider — no extra flag needed. All four providers expose OpenAI-compatible endpoints, so the same CLI works against any of them by setting the right env var:

Model prefixProviderEnv var
claude-*AnthropicANTHROPIC_API_KEY
gpt-* / o1 / o3 / o4*OpenAIOPENAI_API_KEY
gemini-*GoogleGOOGLE_API_KEY
<provider>/<model>OpenRouterOPENROUTER_API_KEY

OpenRouter is the simplest path for testing — one key, every provider, model names like openai/gpt-4o-mini, anthropic/claude-haiku-4.5, google/gemini-2.5-flash.

All 25 environments

Every environment ships objective ground truth (executable or closed-form) plus a calibrated reward. Scores below are from the cross-model validation runs (best verified score across the five evaluated model families).

SlugGround truthVerified score
math-algebra1.0000.740GPT-5
math-algebra-multiturn1.0000.810Claude Opus 4
math-algebra-tools1.0000.860Claude Opus 4
code-humaneval1.0000.790Claude Opus 4
code-humaneval-multiturn1.0000.840Claude Opus 4
code-humaneval-tools1.0000.880Claude Opus 4
code-mini-repo1.0000.610Claude Opus 4
sql-single-turn1.0000.770Claude Opus 4
sql-multiturn1.0000.820Claude Opus 4
tool-calling-single1.0000.720Claude Opus 4
tool-calling-multiturn1.0000.780Claude Opus 4
tool-calling-debug1.0000.690Claude Opus 4
long-context-needle1.0000.910Claude Opus 4
long-context-synthesis1.0000.740Claude Opus 4
long-context-reasoning1.0000.660Claude Opus 4
sparse-fourier-recovery0.8520.364Claude Haiku 4.5
sparse-fourier-recovery-multiturn0.8520.365GPT-5.4
sparse-fourier-recovery-tools0.8520.368GPT-5.4 Nano
lodopab-ct-simplified0.7030.667Claude Opus 4.7
lodopab-ct-simplified-multiturn0.7030.667Claude Opus 4.7
mri-knee-reconstruction0.6880.675Claude Haiku 4.5
mri-knee-reconstruction-multiturn0.6880.683Claude Haiku 4.5
phase-retrieval0.2890.455Claude Haiku 4.5
phase-retrieval-multiturn0.2890.353GPT-5.4 Nano
super-resolution-div2k-x40.7280.733Claude Opus 4.7

Browse them with environment cards on the Environments page or run any of them via Prime Intellect Hub directly.

Cost guide

The CLI computes per-run cost from prompt + completion token counts returned by the provider. Approximate per-episode cost on sparse-fourier-recovery (a small text-only env). Image envs (CT, MRI, super-res) emit larger outputs and cost ~3-5× more.

Model$/1M in$/1M out~Per episode
gpt-4o-mini$0.15$0.60~$0.0002
claude-haiku-4.5$1.00$5.00~$0.0010
gemini-2.5-flash$0.075$0.30~$0.0001
claude-sonnet-4.6$3.00$15.00~$0.005
gpt-4o$2.50$10.00~$0.004
claude-opus-4$15.00$75.00~$0.025
Tip
Tip: use openai/gpt-4o-mini via OpenRouter for development. --episodes 100 on sparse-fourier-recovery$0.02. Switch to opus / sonnet only for production audit runs.

Trace JSONL schema

Every verifiable run writes a JSONL trace at ~/.verifiable/runs/<env>_<model>_<ts>.jsonl. One JSON object per episode. Schema is stable and ships in verifiable_labs_envs.traces.Trace.

FieldTypeMeaning
env_namestringEnvironment slug.
agent_namestringInternal agent id (e.g. openai:claude-haiku-4.5).
seedintPer-episode seed; reproduces the instance bit-identical.
rewardfloatScalar reward in [0, 1].
reward_componentsobjectPer-env breakdown (e.g. nmse, support, conformal).
parse_successboolWhether the LLM’s output parsed cleanly.
failure_typeenumnone / parse_error / timeout / invalid_shape / scoring_error / unknown.
latency_msfloatWall-clock time for the agent call.
observation_hashstringSHA-256 of the env-side prompt; lets you detect drift across SDK releases.
prediction_hashstringSHA-256 of the agent’s parsed prediction.
model_namestringResolved model id passed to the provider.
metadataobjectFree-form. Contains cli, model, error info on failures.

Read traces from Python with one line — the schema never breaks backwards-compat within a major version:

read_traces.py
import json
from pathlib import Path

trace_path = Path.home() / ".verifiable/runs" / "sparse-fourier-recovery_..."
with open(trace_path) as f:
    for line in f:
        t = json.loads(line)
        print(t["seed"], t["reward"], t["reward_components"])

CI integration

Drop a verifiable run into any CI to gate model changes on verified capability checks, not memorised benchmarks. The trace artifact uploads under one line:

verifiable.yml
# .github/workflows/verifiable.yml
name: verifiable-eval

on: [push, pull_request]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install verifiable-labs
      - name: Run benchmark
        env:
          OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
        run: |
          verifiable run \
            --env sparse-fourier-recovery \
            --model openai/gpt-4o-mini \
            --episodes 5 --seed 42
      - uses: actions/upload-artifact@v4
        with:
          name: traces
          path: ~/.verifiable/runs/

Troubleshooting

  • ERROR: Could not find a version that satisfies the requirement verifiable-labs

    Your Python is older than 3.11. Check with python3 --version. Install 3.11 (brew install [email protected] on macOS) and create a venv with python3.11 -m venv .venv.

  • ❌ ANTHROPIC_API_KEY not set

    The model prefix routes to a provider whose key isn’t exported. Either set the key (export ANTHROPIC_API_KEY=…), run verifiable login, or switch to a model from a provider you have a key for.

  • JSON parse error: Expecting ‘,’ delimiter at position …

    The model emitted malformed JSON. Image envs are now lenient (pad / truncate by ≤4 entries). For other envs, raise --episodesso a single bad response doesn’t dominate, or switch to a stronger model.

  • Coverage: 0.7XX (target 0.90) ✗

    Your sample is too small for the conformal calibrator. Use --episodes 50 or higher and re-run.

  • Don’t know how to route model ‘…’

    The model name doesn’t match a known prefix. Use claude-*, gpt-*, o1/o3/o4, gemini-*, or a provider/model form for OpenRouter.

Citation & license

Apache-2.0 for the SDK and CLI. CC-BY-4.0 for the paper. Cite the frozen Zenodo record:

cite.bib
@misc{zacharioudakis2026verifiable,
  title  = {Conformal-Calibrated Rewards for Scientific RL},
  author = {Zacharioudakis, Stelios},
  year   = {2026},
  doi    = {10.5281/zenodo.19786415},
  url    = {https://zenodo.org/records/19786415},
}

Next steps