Install
Sync project dependencies.
uv sync --group dev
Benchmark Overview
Historical live benchmark table and minimal setup path for running replay benchmarks locally.
| Model | Status | Avg. score / 10 | Notes |
|---|---|---|---|
gpt-oss-120b | done | 4.3 | Historical live runs |
glm-4.7 | done | 3.3 | Historical live runs |
qwen3-30b-a3b-instruct-2507 | done | 2.9 | Historical live runs |
qwen3-30b-a3b-thinking-2507 | done | 3.8 | Historical live runs |
mistral-7b-instruct | done | 1.2 | Historical live runs |
gemini-2.5-flash | done | 10.0 | Historical live runs |
gpt-4.1-mini | in progress | pending | OpenAI provider wired, benchmark run pending |
gpt-4.1 | in progress | pending | OpenAI-compatible path available |
gemini-2.5-pro | in progress | pending | Gemini provider path ready |
claude-3.7-sonnet | todo | pending | Add Anthropic provider integration |
llama-3.3-70b-instruct | todo | pending | Add hosted endpoint and benchmark run |
deepseek-r1 | todo | pending | Add provider path and cost tracking |
Sync project dependencies.
uv sync --group dev
Set API keys in shell environment.
export DUEL_API_KEY=replace-me
export GEMINI_API_KEY=replace-me
Execute sample offline run.
uv run duel benchmark \
--source replay \
--dataset examples/replay_sample.json \
--provider oracle \
--runs 2
Source of truth for this page: README benchmark tracker and quickstart snippets.