From cb1c82073b705fe9b1017bef6d6332b000ce04d5 Mon Sep 17 00:00:00 2001 From: ofir-frd Date: Mon, 20 Oct 2025 10:42:23 +0300 Subject: [PATCH] docs: add Claude Haiku 4.5 benchmark results to PR benchmark documentation --- docs/docs/pr_benchmark/index.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md index 216b3206..b8f9ef32 100644 --- a/docs/docs/pr_benchmark/index.md +++ b/docs/docs/pr_benchmark/index.md @@ -70,6 +70,12 @@ A list of the models used for generating the baseline suggestions, and example r 4096 56.3 + + Claude-haiku-4.5 + 2025-10-01 + 4096 + 48.8 + Gemini-2.5-pro 2025-06-05 @@ -194,6 +200,23 @@ weaknesses: - **False positives / speculative fixes:** In several cases it flags non-issues (style, performance, redundant code) or supplies debatable “improvements”, lowering precision and sometimes breaching the “critical bugs only” rule. - **Inconsistent error coverage:** For certain domains (build scripts, schema files, test code) it either returns an empty list when real regressions exist or proposes cosmetic edits, indicating gaps in specialised knowledge. +### Claude-haiku-4.5 (4096 thinking tokens) + +Final score: **48.8** + +strengths: + +- **High precision on detected issues:** When the model does flag a problem it is usually a real, high-impact bug; many answers are judged equal or better than strong baselines because the proposed fix is correct, minimal and easy to apply. +- **Language- and domain-agnostic competence:** It successfully diagnoses defects across a wide range of languages (Python, Go, C/C++, Rust, JS/TS, CSS, SQL, Markdown, etc.) and domains (backend logic, build files, tests, docs). +- **Clear, actionable patches:** Suggested code is typically concise, well-explained and scoped exactly to the added lines, making it practical for reviewers to adopt. + +weaknesses: + +- **Low recall / narrow coverage:** The model often stops after one or two findings, leaving other obvious critical bugs unmentioned; in many examples stronger answers simply covered more ground. +- **Occasional faulty or speculative fixes:** A non-trivial number of responses either mis-diagnose the issue or introduce new errors (e.g., wrong logic, undeclared imports), dropping them below baseline quality. +- **Inconsistent output robustness:** Several cases show truncated or malformed responses, reducing value despite correct analysis elsewhere. +- **Frequent false negatives:** The model sometimes returns an empty list even when clear regressions exist, indicating conservative behaviour that misses mandatory fixes. + ### Claude-sonnet-4.5 (4096 thinking tokens) Final score: **44.2**