diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md
index aa226291..4190d7c7 100644
--- a/docs/docs/pr_benchmark/index.md
+++ b/docs/docs/pr_benchmark/index.md
@@ -77,10 +77,10 @@ A list of the models used for generating the baseline suggestions, and example r
44.3 |
- | Grok-4 |
- 2025-07-09 |
- unknown |
- 41.7 |
+ Claude-sonnet-4.5 |
+ 2025-09-29 |
+ 4096 |
+ 44.2 |
| Claude-sonnet-4.5 |
@@ -112,6 +112,12 @@ A list of the models used for generating the baseline suggestions, and example r
|
33.5 |
+
+ | Grok-4 |
+ 2025-07-09 |
+ unknown |
+ 32.8 |
+
| Claude-4-opus-20250514 |
2025-05-14 |
@@ -188,6 +194,22 @@ weaknesses:
- **False positives / speculative fixes:** In several cases it flags non-issues (style, performance, redundant code) or supplies debatable “improvements”, lowering precision and sometimes breaching the “critical bugs only” rule.
- **Inconsistent error coverage:** For certain domains (build scripts, schema files, test code) it either returns an empty list when real regressions exist or proposes cosmetic edits, indicating gaps in specialised knowledge.
+### Claude-sonnet-4.5 (4096 thinking tokens)
+
+Final score: **44.2**
+
+strengths:
+
+- **High precision / low noise:** When the model does offer fixes they are usually correct, concise and confined to the new '+' lines, rarely introducing spurious or off-scope changes.
+- **Clear, actionable patches:** Suggestions come with well-explained reasoning and minimal but valid code snippets, making them easy for a reviewer to apply.
+- **Good rule compliance:** It almost always respects the 1-3 suggestion limit, avoids touching unchanged code and seldom violates formatting or other task guidelines.
+
+weaknesses:
+
+- **Low recall / frequent omissions:** In a large share of cases the model returns an empty list or only one minor tip while overlooking obvious, higher-impact regressions found by peers.
+- **Narrow coverage when it does respond:** Even in non-empty outputs it typically fixes a single issue and ignores related defects in the same diff, indicating shallow analysis.
+- **Occasional harmful or incomplete fixes:** A few suggestions introduce new errors (e.g., wrong logic, missing imports, malformed snippets) or mark non-critical style nits as "critical", reducing trust.
+
### Claude-sonnet-4.5
Final score: **40.7**