Benchmark Overview

Duel model tracker

Historical live benchmark table and minimal setup path for running replay benchmarks locally.

Model	Status	Avg. score / 10	Notes
`gpt-oss-120b`	done	4.3	Historical live runs
`glm-4.7`	done	3.3	Historical live runs
`qwen3-30b-a3b-instruct-2507`	done	2.9	Historical live runs
`qwen3-30b-a3b-thinking-2507`	done	3.8	Historical live runs
`mistral-7b-instruct`	done	1.2	Historical live runs
`gemini-2.5-flash`	done	10.0	Historical live runs
`gpt-4.1-mini`	in progress	pending	OpenAI provider wired, benchmark run pending
`gpt-4.1`	in progress	pending	OpenAI-compatible path available
`gemini-2.5-pro`	in progress	pending	Gemini provider path ready
`claude-3.7-sonnet`	todo	pending	Add Anthropic provider integration
`llama-3.3-70b-instruct`	todo	pending	Add hosted endpoint and benchmark run
`deepseek-r1`	todo	pending	Add provider path and cost tracking

Setup

Sync project dependencies.

uv sync --group dev

Set API keys in shell environment.

export DUEL_API_KEY=replace-me
export GEMINI_API_KEY=replace-me

Execute sample offline run.

uv run duel benchmark \
  --source replay \
  --dataset examples/replay_sample.json \
  --provider oracle \
  --runs 2

Source of truth for this page: README benchmark tracker and quickstart snippets.