From edd9ef9d4fab760d8a36242ff1457d6498f5e0fb Mon Sep 17 00:00:00 2001 From: ofir-frd Date: Thu, 20 Nov 2025 11:32:09 +0200 Subject: [PATCH] docs: add Gemini-3-pro-review benchmark results to PR benchmark documentation --- docs/docs/pr_benchmark/index.md | 46 +++++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md index 090f3871..653f725e 100644 --- a/docs/docs/pr_benchmark/index.md +++ b/docs/docs/pr_benchmark/index.md @@ -70,12 +70,24 @@ A list of the models used for generating the baseline suggestions, and example r 'medium' (8000) 57.7 + + Gemini-3-pro-review + 2025-11-18 + high + 57.3 + Gemini-2.5-pro 2025-06-05 4096 56.3 + + Gemini-3-pro-review + 2025-11-18 + low + 55.6 + Claude-haiku-4.5 2025-10-01 @@ -218,6 +230,23 @@ Weaknesses: - **False or harmful fixes:** Several answers introduce new compilation errors, propose out-of-scope changes, or violate explicit rules (e.g., adding imports, version bumps, touching untouched lines), reducing trustworthiness. - **Shallow coverage:** Even when it identifies one real issue it often stops there, missing additional critical problems found by stronger peers; breadth and depth are inconsistent. +### Gemini-3-pro-review (high thinking budget) + +Final score: **57.3** + +Strengths: + +- **Good schema & format discipline:** Consistently returns well-formed YAML with correct fields and respects the 3-suggestion limit; rarely breaks the required output structure. +- **Reasonable guideline awareness:** Often recognises when a diff contains only data / translations and properly emits an empty list, avoiding over-reporting. +- **Clear, actionable patches when correct:** When it does find a bug it usually supplies minimal-diff, compilable code snippets with concise explanations, and occasionally surfaces issues no other model spotted. + +Weaknesses: + +- **Spot-coverage gaps on critical defects:** In a large share of cases it overlooks the principal regression the tests were written for, while fixating on minor style or performance nits. +- **False or speculative fixes:** A noticeable number of answers invent non-existent problems or propose changes that would not compile or would re-introduce removed behaviour. +- **Guideline violations creep in:** Sometimes touches unchanged lines, adds forbidden imports / labels, or supplies more than "critical" advice, showing imperfect rule adherence. +- **High variance / inconsistency:** Quality swings from best-in-class to harmful within consecutive examples, indicating unstable defect-prioritisation and review depth. + ### Gemini-2.5 Pro (4096 thinking tokens) Final score: **56.3** @@ -236,6 +265,23 @@ Weaknesses: - **False positives / speculative fixes:** In several cases it flags non-issues (style, performance, redundant code) or supplies debatable “improvements”, lowering precision and sometimes breaching the “critical bugs only” rule. - **Inconsistent error coverage:** For certain domains (build scripts, schema files, test code) it either returns an empty list when real regressions exist or proposes cosmetic edits, indicating gaps in specialised knowledge. +### Gemini-3-pro-review (low thinking budget) + +Final score: **55.6** + +Strengths: + +- **Concise, well-structured patches:** Suggestions are usually expressed in short, self-contained YAML items with clear before/after code blocks and just enough rationale, making them easy for reviewers to apply. +- **Good eye for crash-level defects:** When the model does spot a problem it often focuses on high-impact issues such as compile-time errors, NPEs, nil-pointer races, buffer overflows, etc., and supplies a minimal, correct fix. +- **High guideline compliance (format & scope):** In most cases it respects the 1-3-item limit and the "new lines only" rule, avoids changing imports, and keeps snippets syntactically valid. + +Weaknesses: + +- **Coverage inconsistency:** Many answers miss other obvious or even more critical regressions spotted by peers; breadth fluctuates from excellent to empty, leaving reviewers with partial insight. +- **False positives & speculative advice:** A noticeable share of suggestions target stylistic or non-critical tweaks, or even introduce wrong changes, betraying occasional mis-reading of the diff and hurting trust. +- **Rule violations still occur:** There are repeated instances of touching unchanged code, recommending version bumps/imports, mis-labelling severities, or outputting malformed snippets—showing lapses in instruction adherence. +- **Quality variance / empty outputs:** Some responses provide no suggestions despite real bugs, while others supply harmful fixes; this volatility lowers overall reliability. + ### Claude-haiku-4.5 (4096 thinking tokens) Final score: **48.8**