Benchmark Overview

Duel model tracker

Historical live benchmark table and minimal setup path for running replay benchmarks locally.

Model Status Avg. score / 10 Notes
gpt-oss-120bdone4.3Historical live runs
glm-4.7done3.3Historical live runs
qwen3-30b-a3b-instruct-2507done2.9Historical live runs
qwen3-30b-a3b-thinking-2507done3.8Historical live runs
mistral-7b-instructdone1.2Historical live runs
gemini-2.5-flashdone10.0Historical live runs
gpt-4.1-miniin progresspendingOpenAI provider wired, benchmark run pending
gpt-4.1in progresspendingOpenAI-compatible path available
gemini-2.5-proin progresspendingGemini provider path ready
claude-3.7-sonnettodopendingAdd Anthropic provider integration
llama-3.3-70b-instructtodopendingAdd hosted endpoint and benchmark run
deepseek-r1todopendingAdd provider path and cost tracking

Setup

Install

Sync project dependencies.

uv sync --group dev

Configure providers

Set API keys in shell environment.

export DUEL_API_KEY=replace-me
export GEMINI_API_KEY=replace-me

Run replay benchmark

Execute sample offline run.

uv run duel benchmark \
  --source replay \
  --dataset examples/replay_sample.json \
  --provider oracle \
  --runs 2

Source of truth for this page: README benchmark tracker and quickstart snippets.