预览榜(占位数据) — rubric 已定稿,正式双评委跑分后将替换为 C 级实测分。请勿对外引用排名。

代码生成场景榜

样本量 N=15 · 2026-05-20 · 方法论 · docs/benchmarks/cn-code-generation.md

排名模型总分正确性可读性边界处理效率评测日
1Claude 3.7 SonnetAnthropic4.554.524.614.494.582026-05-20
2GPT-4.1OpenAI4.524.594.484.564.452026-05-20
3DeepSeek-V3DeepSeek4.484.454.544.424.512026-05-20
4DeepSeek-R1DeepSeek4.454.374.464.554.432026-05-20
5Qwen2.5-Max阿里云4.424.444.324.414.502026-05-20
6GPT-4oOpenAI4.384.304.394.484.362026-05-20
7GLM-4-Plus智谱AI4.354.324.414.294.382026-05-20
8Claude 3.5 SonnetAnthropic4.324.294.384.264.352026-05-20
9CodestralMistral4.284.304.184.274.362026-05-20
10DeepSeek-Coder-V2DeepSeek4.254.224.314.194.282026-05-20
11Qwen3-Max阿里云4.204.174.264.144.232026-05-20
12GPT-4o-miniOpenAI3.954.023.913.993.882026-05-20
13Llama 3.3 70B InstructMeta3.883.953.843.923.812026-05-20
14Mistral LargeMistral3.823.743.833.923.802026-05-20
15InternLM2.5-20B上海AI实验室3.753.673.763.853.732026-05-20

占位分 · 公式模拟 · 正式跑分前勿引用