$ run claude-vs-codex --diff
Claude vs Codex Benchmark
2 projects · 4 stages · 2 independent judges
Evaluated Models
claude = claude-opus-4.6
codex = gpt-5.4
Judge Models
J1 gpt-5.4-mini
J2 claude-sonnet-4.6
Both judges unanimously agree: codex wins overall. Two independent models evaluated the same outputs and reached the same conclusion across 7 of 8 project/stage pairs.
1 split decision:
Project 1 / refine — J1 picked codex; J2 picked claude.
J1 Pass Rates
Claude
0.647
Codex
0.989
J2 Pass Rates
Claude
0.754
Codex
0.976
Pass Rate by Stage
J1
J2
refine
explore
plan
do
claude
codex
Judge Scores (1–5)
| Project | Stage | J1 Claude | J1 Codex | J1 Winner | J2 Claude | J2 Codex | J2 Winner |
|---|---|---|---|---|---|---|---|
| Project 1 | refine | 3 | 5 | codex | 5 | 4 | claudeSPLIT |
| Project 1 | explore | 2 | 5 | codex | 3 | 5 | codex |
| Project 1 | plan | 2 | 5 | codex | 3 | 5 | codex |
| Project 1 | do | 2 | 5 | codex | 3 | 4 | codex |
| Project 2 | refine | 4 | 5 | codex | 4 | 5 | codex |
| Project 2 | explore | 1 | 4 | codex | 3 | 5 | codex |
| Project 2 | plan | 2 | 5 | codex | 3 | 5 | codex |
| Project 2 | do | 1 | 2 | codex | 3 | 4 | codex |