Documentation
Local-first developer documentation. Five minutes from pip install to first calibrated reward. No account required, no telemetry, no feature gates.
Before you start
You’ll need Python 3.11+, an API key for at least one provider (Anthropic, OpenAI, Google, or OpenRouter), and ~500 MB for the package and its runtime dependencies. Everything runs locally. No Verifiable Labs account required.
Installation
We recommend a fresh virtual environment so the heavy runtime dependencies (numpy, scipy, jax) don’t collide with your system Python.
# Recommended: a fresh virtual env (Python 3.11+) python3.11 -m venv .venv source .venv/bin/activate pip install --upgrade pip pip install verifiable-labs
On macOS, install Python 3.11 with brew install [email protected]. On Debian/Ubuntu, use sudo apt install python3.11 python3.11-venv. On Windows, use WSL2 with Ubuntu.
Verify the SDK with verifiable --version (should print verifiable-labs 0.1.0a4) and verifiable list to see all 25 environments.
Set up an API key
The CLI reads provider keys from environment variables. The model name on the command line picks the provider automatically:
# Pick the provider you want to evaluate export ANTHROPIC_API_KEY=sk-ant-... # Claude models (claude-haiku-4.5, …) export OPENAI_API_KEY=sk-... # GPT / o-series export GOOGLE_API_KEY=... # Gemini # Or use OpenRouter to access any provider with one key export OPENROUTER_API_KEY=sk-or-v1-...
Prefer interactive setup? verifiable login writes the keys to ~/.verifiable/config.toml with mode 0600 (read/write only by your user). Environment variables always take priority over the file.
Run your first evaluation
Pick an environment, a model, and a seed. The CLI samples a fresh task instance, runs the agent, computes the objective ground truth, and emits a calibrated reward interval you can verify.
$ verifiable run \ --env sparse-fourier-recovery \ --model claude-haiku-4.5 \ --episodes 10 \ --seed 42
Expected output (real numbers; yours will differ slightly):
✓ Loading environment: sparse-fourier-recovery ✓ Calibrating conformal interval (target 0.90)... ✓ Running 10 episodes... Mean reward: 0.327 ± 0.047 Coverage: 0.933 (target 0.90) ✓ Time: 0m 12s · Cost: $0.0021 Trace saved to ~/.verifiable/runs/sparse-fourier-recovery_claude-haiku-4.5_…jsonl
Verify the output
Every run prints a reward, a conformal interval, and a coverage check. Mean reward is the average across episodes (1 = solved, 0 = unsolved). Coverage is the fraction of episodes whose ground truth fell inside the calibrated reward interval (target 0.90 by default). If coverage drops below the target, the calibration set is too small — --episodes 100 stabilises it.
--seed guarantees identical problem instances across machines. Seeds are versioned alongside the SDK release and never silently changed.Python API
Drive environments directly from Python — useful for embedding verifiable rewards inside a training loop or building a custom harness. The same package that ships the CLI also re-exports load_environment and list_environments.
from verifiable_labs import load_environment, list_environments # Discover what’s available print(list_environments()) # ['lodopab-ct-simplified', 'lodopab-ct-simplified-multiturn', ...] # Load an env and run its classical baseline (no API key required) env = load_environment("sparse-fourier-recovery") result = env.run_baseline(seed=42) print(f"reward: {result['reward']:.3f}") print(f"components: {result['components']}") # nmse / support / conformal # Or generate a single instance and inspect what the LLM sees instance = env.generate_instance(seed=42) print(instance.as_inputs().keys())
The same package also exposes the hosted-API client surface (Client, AsyncClient, Environment, Session) for once the public REST API ships in v0.2.
CLI reference
The friendly CLI surface — what most users need.
| Command | Purpose |
|---|---|
verifiable list | Print all 25 environment slugs, one per line. |
verifiable info --env <slug> | Show metadata for one env (name, domain, target coverage). |
verifiable run --env <slug> --model <name> [--episodes N] [--seed N] | Run an LLM agent on the env. Defaults: --episodes 10, --seed 42. Saves a JSONL trace to ~/.verifiable/runs/. |
verifiable login | Interactive prompt to write API keys to ~/.verifiable/config.toml mode 0600. |
verifiable --version | Print SDK version. |
verifiable --help | List subcommands + global flags. |
Lower-level dev commands — envs, compare, report, init-env, validate-env, run --agent FILE --out F.jsonl — also work and are documented in the CLI reference on GitHub.
Provider routing
The model name passed to --model picks the provider — no extra flag needed. All four providers expose OpenAI-compatible endpoints, so the same CLI works against any of them by setting the right env var:
| Model prefix | Provider | Env var |
|---|---|---|
claude-* | Anthropic | ANTHROPIC_API_KEY |
gpt-* / o1 / o3 / o4* | OpenAI | OPENAI_API_KEY |
gemini-* | GOOGLE_API_KEY | |
<provider>/<model> | OpenRouter | OPENROUTER_API_KEY |
OpenRouter is the simplest path for testing — one key, every provider, model names like openai/gpt-4o-mini, anthropic/claude-haiku-4.5, google/gemini-2.5-flash.
All 25 environments
Every environment ships objective ground truth (executable or closed-form) plus a calibrated reward. Scores below are from the cross-model validation runs (best verified score across the five evaluated model families).
| Slug | Ground truth | Verified score |
|---|---|---|
math-algebra | 1.000 | 0.740GPT-5 |
math-algebra-multiturn | 1.000 | 0.810Claude Opus 4 |
math-algebra-tools | 1.000 | 0.860Claude Opus 4 |
code-humaneval | 1.000 | 0.790Claude Opus 4 |
code-humaneval-multiturn | 1.000 | 0.840Claude Opus 4 |
code-humaneval-tools | 1.000 | 0.880Claude Opus 4 |
code-mini-repo | 1.000 | 0.610Claude Opus 4 |
sql-single-turn | 1.000 | 0.770Claude Opus 4 |
sql-multiturn | 1.000 | 0.820Claude Opus 4 |
tool-calling-single | 1.000 | 0.720Claude Opus 4 |
tool-calling-multiturn | 1.000 | 0.780Claude Opus 4 |
tool-calling-debug | 1.000 | 0.690Claude Opus 4 |
long-context-needle | 1.000 | 0.910Claude Opus 4 |
long-context-synthesis | 1.000 | 0.740Claude Opus 4 |
long-context-reasoning | 1.000 | 0.660Claude Opus 4 |
sparse-fourier-recovery | 0.852 | 0.364Claude Haiku 4.5 |
sparse-fourier-recovery-multiturn | 0.852 | 0.365GPT-5.4 |
sparse-fourier-recovery-tools | 0.852 | 0.368GPT-5.4 Nano |
lodopab-ct-simplified | 0.703 | 0.667Claude Opus 4.7 |
lodopab-ct-simplified-multiturn | 0.703 | 0.667Claude Opus 4.7 |
mri-knee-reconstruction | 0.688 | 0.675Claude Haiku 4.5 |
mri-knee-reconstruction-multiturn | 0.688 | 0.683Claude Haiku 4.5 |
phase-retrieval | 0.289 | 0.455Claude Haiku 4.5 |
phase-retrieval-multiturn | 0.289 | 0.353GPT-5.4 Nano |
super-resolution-div2k-x4 | 0.728 | 0.733Claude Opus 4.7 |
Browse them with environment cards on the Environments page or run any of them via Prime Intellect Hub directly.
Cost guide
The CLI computes per-run cost from prompt + completion token counts returned by the provider. Approximate per-episode cost on sparse-fourier-recovery (a small text-only env). Image envs (CT, MRI, super-res) emit larger outputs and cost ~3-5× more.
| Model | $/1M in | $/1M out | ~Per episode |
|---|---|---|---|
gpt-4o-mini | $0.15 | $0.60 | ~$0.0002 |
claude-haiku-4.5 | $1.00 | $5.00 | ~$0.0010 |
gemini-2.5-flash | $0.075 | $0.30 | ~$0.0001 |
claude-sonnet-4.6 | $3.00 | $15.00 | ~$0.005 |
gpt-4o | $2.50 | $10.00 | ~$0.004 |
claude-opus-4 | $15.00 | $75.00 | ~$0.025 |
openai/gpt-4o-mini via OpenRouter for development. --episodes 100 on sparse-fourier-recovery ≈ $0.02. Switch to opus / sonnet only for production audit runs.Trace JSONL schema
Every verifiable run writes a JSONL trace at ~/.verifiable/runs/<env>_<model>_<ts>.jsonl. One JSON object per episode. Schema is stable and ships in verifiable_labs_envs.traces.Trace.
| Field | Type | Meaning |
|---|---|---|
env_name | string | Environment slug. |
agent_name | string | Internal agent id (e.g. openai:claude-haiku-4.5). |
seed | int | Per-episode seed; reproduces the instance bit-identical. |
reward | float | Scalar reward in [0, 1]. |
reward_components | object | Per-env breakdown (e.g. nmse, support, conformal). |
parse_success | bool | Whether the LLM’s output parsed cleanly. |
failure_type | enum | none / parse_error / timeout / invalid_shape / scoring_error / unknown. |
latency_ms | float | Wall-clock time for the agent call. |
observation_hash | string | SHA-256 of the env-side prompt; lets you detect drift across SDK releases. |
prediction_hash | string | SHA-256 of the agent’s parsed prediction. |
model_name | string | Resolved model id passed to the provider. |
metadata | object | Free-form. Contains cli, model, error info on failures. |
Read traces from Python with one line — the schema never breaks backwards-compat within a major version:
import json from pathlib import Path trace_path = Path.home() / ".verifiable/runs" / "sparse-fourier-recovery_..." with open(trace_path) as f: for line in f: t = json.loads(line) print(t["seed"], t["reward"], t["reward_components"])
CI integration
Drop a verifiable run into any CI to gate model changes on verified capability checks, not memorised benchmarks. The trace artifact uploads under one line:
# .github/workflows/verifiable.yml name: verifiable-eval on: [push, pull_request] jobs: benchmark: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: { python-version: "3.11" } - run: pip install verifiable-labs - name: Run benchmark env: OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }} run: | verifiable run \ --env sparse-fourier-recovery \ --model openai/gpt-4o-mini \ --episodes 5 --seed 42 - uses: actions/upload-artifact@v4 with: name: traces path: ~/.verifiable/runs/
Troubleshooting
ERROR: Could not find a version that satisfies the requirement verifiable-labsYour Python is older than 3.11. Check with
python3 --version. Install 3.11 (brew install [email protected]on macOS) and create a venv withpython3.11 -m venv .venv.❌ ANTHROPIC_API_KEY not setThe model prefix routes to a provider whose key isn’t exported. Either set the key (
export ANTHROPIC_API_KEY=…), runverifiable login, or switch to a model from a provider you have a key for.JSON parse error: Expecting ‘,’ delimiter at position …The model emitted malformed JSON. Image envs are now lenient (pad / truncate by ≤4 entries). For other envs, raise
--episodesso a single bad response doesn’t dominate, or switch to a stronger model.Coverage: 0.7XX (target 0.90) ✗Your sample is too small for the conformal calibrator. Use
--episodes 50or higher and re-run.Don’t know how to route model ‘…’The model name doesn’t match a known prefix. Use
claude-*,gpt-*,o1/o3/o4,gemini-*, or aprovider/modelform for OpenRouter.
Citation & license
Apache-2.0 for the SDK and CLI. CC-BY-4.0 for the paper. Cite the frozen Zenodo record:
@misc{zacharioudakis2026verifiable, title = {Conformal-Calibrated Rewards for Scientific RL}, author = {Zacharioudakis, Stelios}, year = {2026}, doi = {10.5281/zenodo.19786415}, url = {https://zenodo.org/records/19786415}, }
Next steps
- → Browse all 25 environments with baselines and verified scores.
- → Read the paper to understand the conformal-calibration protocol.
- → Star the GitHub repo and follow releases.