From 33744d95444371d712635019daee581ddb7974a2 Mon Sep 17 00:00:00 2001 From: ofir-frd Date: Mon, 20 Oct 2025 10:15:42 +0300 Subject: [PATCH] docs: update Claude Sonnet 4.5 thinking benchmark results and reorder model rankings --- docs/docs/pr_benchmark/index.md | 30 ++++++++++++++++++++++++++---- 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md index aa226291..4190d7c7 100644 --- a/docs/docs/pr_benchmark/index.md +++ b/docs/docs/pr_benchmark/index.md @@ -77,10 +77,10 @@ A list of the models used for generating the baseline suggestions, and example r 44.3 - Grok-4 - 2025-07-09 - unknown - 41.7 + Claude-sonnet-4.5 + 2025-09-29 + 4096 + 44.2 Claude-sonnet-4.5 @@ -112,6 +112,12 @@ A list of the models used for generating the baseline suggestions, and example r 33.5 + + Grok-4 + 2025-07-09 + unknown + 32.8 + Claude-4-opus-20250514 2025-05-14 @@ -188,6 +194,22 @@ weaknesses: - **False positives / speculative fixes:** In several cases it flags non-issues (style, performance, redundant code) or supplies debatable “improvements”, lowering precision and sometimes breaching the “critical bugs only” rule. - **Inconsistent error coverage:** For certain domains (build scripts, schema files, test code) it either returns an empty list when real regressions exist or proposes cosmetic edits, indicating gaps in specialised knowledge. +### Claude-sonnet-4.5 (4096 thinking tokens) + +Final score: **44.2** + +strengths: + +- **High precision / low noise:** When the model does offer fixes they are usually correct, concise and confined to the new '+' lines, rarely introducing spurious or off-scope changes. +- **Clear, actionable patches:** Suggestions come with well-explained reasoning and minimal but valid code snippets, making them easy for a reviewer to apply. +- **Good rule compliance:** It almost always respects the 1-3 suggestion limit, avoids touching unchanged code and seldom violates formatting or other task guidelines. + +weaknesses: + +- **Low recall / frequent omissions:** In a large share of cases the model returns an empty list or only one minor tip while overlooking obvious, higher-impact regressions found by peers. +- **Narrow coverage when it does respond:** Even in non-empty outputs it typically fixes a single issue and ignores related defects in the same diff, indicating shallow analysis. +- **Occasional harmful or incomplete fixes:** A few suggestions introduce new errors (e.g., wrong logic, missing imports, malformed snippets) or mark non-critical style nits as "critical", reducing trust. + ### Claude-sonnet-4.5 Final score: **40.7**