预览榜(占位数据) — rubric 已定稿,正式双评委跑分后将替换为 C 级实测分。请勿对外引用排名。

数学推理场景榜

样本量 N=12 · 2026-05-20 · 方法论 · docs/benchmarks/cn-math-reasoning.md

排名模型总分答案步骤表述评测日
1DeepSeek-R1DeepSeek4.624.594.684.562026-05-20
2o3-miniOpenAI4.584.654.544.622026-05-20
3Claude 3.7 SonnetAnthropic4.554.524.614.492026-05-20
4GPT-4.1OpenAI4.524.594.484.562026-05-20
5Qwen3-Max阿里云4.454.374.464.552026-05-20
6DeepSeek-V3DeepSeek4.404.374.464.342026-05-20
7GPT-4oOpenAI4.354.324.414.292026-05-20
8GLM-4-Plus智谱AI4.284.304.184.272026-05-20
9Kimi 最新档月之暗面4.224.304.174.272026-05-20
10QwQ-32B阿里云4.154.234.104.202026-05-20
11GPT-4o-miniOpenAI3.883.953.843.922026-05-20
12Gemini 2.0 FlashGoogle3.823.743.833.922026-05-20

占位分 · 中文应用题 · 正式双评委前勿引用