From eebdeea9f9d10d70491beb7cfce95b647695c4ec Mon Sep 17 00:00:00 2001 From: ofir-frd Date: Mon, 20 Oct 2025 10:56:29 +0300 Subject: [PATCH] docs: Restore Claude Haiku 4.5 benchmark results and analysis to PR benchmark documentation --- docs/docs/pr_benchmark/index.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md index b8f9ef32..3c23f380 100644 --- a/docs/docs/pr_benchmark/index.md +++ b/docs/docs/pr_benchmark/index.md @@ -94,6 +94,12 @@ A list of the models used for generating the baseline suggestions, and example r 40.7 + + Claude-haiku-4.5 + 2025-10-01 + + 40.7 + Claude-4-sonnet 2025-05-14 @@ -251,6 +257,23 @@ weaknesses: - **Priority mistakes:** The model often downgrades severe defects to “general” or upgrades cosmetic nits to “critical”, showing weak bug-severity judgment. - **Inconsistent quality:** Performance swings widely between excellent and poor; reviewers cannot predict whether a given answer will be thorough, partial, or incorrect. +### Claude-haiku-4.5 + +Final score: 40.7 + +Strengths: + +- **Good format & clarity: Consistently produces valid YAML and readable, minimally-intrusive patches with clear before/after snippets, so its outputs are easy to apply. +- **Basic bug-spotting ability: Often detects the most obvious new-line defect (e.g., syntax error, missing guard, wrong constant) and supplies a correct, concise fix; rarely ranks last in the set. +- **Rule compliance in many cases: Usually stays within the 3-suggestion limit, touches only '+' lines, and avoids speculative refactors—returning an empty list when no code was added. + +Weaknesses: + +- **Shallow coverage: Frequently fixes just one surface-level issue and misses additional, higher-impact bugs that stronger reviewers catch, leaving regressions in place. +- **Occasional incorrect or no-op patches: A noticeable share of suggestions either leave code unchanged, contain invalid code, or introduce new errors, lowering trust. +- **Guideline slips: In several examples it edits unchanged lines, adds forbidden imports/version bumps, mis-labels severities, or supplies non-critical stylistic advice. +- **Inconsistent diligence: Roughly a quarter of the cases return an empty list despite real problems, while others duplicate existing PR changes, indicating weak diff comprehension. + ### Claude-4 Sonnet (4096 thinking tokens) Final score: **39.7**