benchmark-results — claude-vs-codex
$ run claude-vs-codex --diff

Claude vs Codex Benchmark

2 projects · 4 stages · 2 independent judges
Evaluated Models
claude = claude-opus-4.6
codex = gpt-5.4
Judge Models
J1 gpt-5.4-mini
J2 claude-sonnet-4.6
>_
Both judges unanimously agree: codex wins overall. Two independent models evaluated the same outputs and reached the same conclusion across 7 of 8 project/stage pairs. 1 split decision: Project 1 / refine — J1 picked codex; J2 picked claude.
Claude
0.647
Codex
0.989
Claude
0.754
Codex
0.976
J1
J2
refine
0.80
1.00
0.90
0.95
explore
0.37
1.00
0.50
1.00
plan
0.64
0.95
0.73
0.95
do
0.78
1.00
0.89
1.00
claude codex
Project Stage J1 Claude J1 Codex J1 Winner J2 Claude J2 Codex J2 Winner
Project 1 refine 3 5 codex 5 4 claudeSPLIT
Project 1 explore 2 5 codex 3 5 codex
Project 1 plan 2 5 codex 3 5 codex
Project 1 do 2 5 codex 3 4 codex
Project 2 refine 4 5 codex 4 5 codex
Project 2 explore 1 4 codex 3 5 codex
Project 2 plan 2 5 codex 3 5 codex
Project 2 do 1 2 codex 3 4 codex