diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md
index 090f3871..653f725e 100644
--- a/docs/docs/pr_benchmark/index.md
+++ b/docs/docs/pr_benchmark/index.md
@@ -70,12 +70,24 @@ A list of the models used for generating the baseline suggestions, and example r
'medium' (8000) |
57.7 |
+
+ | Gemini-3-pro-review |
+ 2025-11-18 |
+ high |
+ 57.3 |
+
| Gemini-2.5-pro |
2025-06-05 |
4096 |
56.3 |
+
+ | Gemini-3-pro-review |
+ 2025-11-18 |
+ low |
+ 55.6 |
+
| Claude-haiku-4.5 |
2025-10-01 |
@@ -218,6 +230,23 @@ Weaknesses:
- **False or harmful fixes:** Several answers introduce new compilation errors, propose out-of-scope changes, or violate explicit rules (e.g., adding imports, version bumps, touching untouched lines), reducing trustworthiness.
- **Shallow coverage:** Even when it identifies one real issue it often stops there, missing additional critical problems found by stronger peers; breadth and depth are inconsistent.
+### Gemini-3-pro-review (high thinking budget)
+
+Final score: **57.3**
+
+Strengths:
+
+- **Good schema & format discipline:** Consistently returns well-formed YAML with correct fields and respects the 3-suggestion limit; rarely breaks the required output structure.
+- **Reasonable guideline awareness:** Often recognises when a diff contains only data / translations and properly emits an empty list, avoiding over-reporting.
+- **Clear, actionable patches when correct:** When it does find a bug it usually supplies minimal-diff, compilable code snippets with concise explanations, and occasionally surfaces issues no other model spotted.
+
+Weaknesses:
+
+- **Spot-coverage gaps on critical defects:** In a large share of cases it overlooks the principal regression the tests were written for, while fixating on minor style or performance nits.
+- **False or speculative fixes:** A noticeable number of answers invent non-existent problems or propose changes that would not compile or would re-introduce removed behaviour.
+- **Guideline violations creep in:** Sometimes touches unchanged lines, adds forbidden imports / labels, or supplies more than "critical" advice, showing imperfect rule adherence.
+- **High variance / inconsistency:** Quality swings from best-in-class to harmful within consecutive examples, indicating unstable defect-prioritisation and review depth.
+
### Gemini-2.5 Pro (4096 thinking tokens)
Final score: **56.3**
@@ -236,6 +265,23 @@ Weaknesses:
- **False positives / speculative fixes:** In several cases it flags non-issues (style, performance, redundant code) or supplies debatable “improvements”, lowering precision and sometimes breaching the “critical bugs only” rule.
- **Inconsistent error coverage:** For certain domains (build scripts, schema files, test code) it either returns an empty list when real regressions exist or proposes cosmetic edits, indicating gaps in specialised knowledge.
+### Gemini-3-pro-review (low thinking budget)
+
+Final score: **55.6**
+
+Strengths:
+
+- **Concise, well-structured patches:** Suggestions are usually expressed in short, self-contained YAML items with clear before/after code blocks and just enough rationale, making them easy for reviewers to apply.
+- **Good eye for crash-level defects:** When the model does spot a problem it often focuses on high-impact issues such as compile-time errors, NPEs, nil-pointer races, buffer overflows, etc., and supplies a minimal, correct fix.
+- **High guideline compliance (format & scope):** In most cases it respects the 1-3-item limit and the "new lines only" rule, avoids changing imports, and keeps snippets syntactically valid.
+
+Weaknesses:
+
+- **Coverage inconsistency:** Many answers miss other obvious or even more critical regressions spotted by peers; breadth fluctuates from excellent to empty, leaving reviewers with partial insight.
+- **False positives & speculative advice:** A noticeable share of suggestions target stylistic or non-critical tweaks, or even introduce wrong changes, betraying occasional mis-reading of the diff and hurting trust.
+- **Rule violations still occur:** There are repeated instances of touching unchanged code, recommending version bumps/imports, mis-labelling severities, or outputting malformed snippets—showing lapses in instruction adherence.
+- **Quality variance / empty outputs:** Some responses provide no suggestions despite real bugs, while others supply harmful fixes; this volatility lowers overall reliability.
+
### Claude-haiku-4.5 (4096 thinking tokens)
Final score: **48.8**