From 76ed9156d30341ea4395cd0c1c187e2f384d209f Mon Sep 17 00:00:00 2001 From: tomoya-kawaguchi Date: Tue, 14 Oct 2025 11:33:52 +0900 Subject: [PATCH 1/5] fix: safe attribute access in review prompt template --- pr_agent/settings/pr_reviewer_prompts.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pr_agent/settings/pr_reviewer_prompts.toml b/pr_agent/settings/pr_reviewer_prompts.toml index a2f2d8a8..f13e1aa2 100644 --- a/pr_agent/settings/pr_reviewer_prompts.toml +++ b/pr_agent/settings/pr_reviewer_prompts.toml @@ -217,7 +217,7 @@ Ticket Description: ##### {%- endif %} -{%- if ticket.requirements %} +{%- if ticket.requirements is defined %} Ticket Requirements: ##### {{ ticket.requirements }} From 8c7712bf30d9c608a25bdff2b3f5e23e655b5086 Mon Sep 17 00:00:00 2001 From: ofir-frd Date: Thu, 16 Oct 2025 15:38:11 +0300 Subject: [PATCH 2/5] docs: add Claude Haiku 4.5 benchmark results to PR benchmark documentation --- docs/docs/pr_benchmark/index.md | 30 +++++++++++++++++++++++++++--- 1 file changed, 27 insertions(+), 3 deletions(-) diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md index aa226291..1789d834 100644 --- a/docs/docs/pr_benchmark/index.md +++ b/docs/docs/pr_benchmark/index.md @@ -88,6 +88,12 @@ A list of the models used for generating the baseline suggestions, and example r 40.7 + + Claude-haiku-4.5 + 2025-10-01 + + 40.7 + Claude-4-sonnet 2025-05-14 @@ -188,7 +194,7 @@ weaknesses: - **False positives / speculative fixes:** In several cases it flags non-issues (style, performance, redundant code) or supplies debatable “improvements”, lowering precision and sometimes breaching the “critical bugs only” rule. - **Inconsistent error coverage:** For certain domains (build scripts, schema files, test code) it either returns an empty list when real regressions exist or proposes cosmetic edits, indicating gaps in specialised knowledge. -### Claude-sonnet-4.5 +### Claude-Sonnet-4.5 Final score: **40.7** @@ -196,16 +202,34 @@ strengths: - **Concise & well-formatted output:** Most replies strictly follow the schema, stay within the 3-suggestion limit, and include clear, copy-paste-ready patches, making them easy to apply. - **Can spot headline bugs:** When a single, obvious regression is present (e.g. duplicated regex block, missing null-check, wrong macro name) the model often detects it and proposes an accurate, minimal fix. -- **Scope discipline (usually):** It frequently restricts changes to newly-added lines and avoids broad refactors, so many answers comply with the “new code only / critical bugs only” rule. +- **Scope discipline (usually):** It frequently restricts changes to newly-added lines and avoids broad refactors, so many answers comply with the "new code only / critical bugs only" rule. - **Reasonable explanations:** The accompanying rationales are typically short but precise, helping reviewers understand why the change is needed. weaknesses: - **Low recall of critical issues:** In a large fraction of examples the model misses the primary bug or flags nothing at all while other reviewers find clear problems. Coverage is therefore unreliable. - **False or harmful fixes:** A notable number of suggestions mis-diagnose the code, touch unchanged lines, violate task rules, or would break compilation/runtime (wrong paths, bad types, guideline-forbidden advice). -- **Priority mistakes:** The model often downgrades severe defects to “general” or upgrades cosmetic nits to “critical”, showing weak bug-severity judgment. +- **Priority mistakes:** The model often downgrades severe defects to "general" or upgrades cosmetic nits to "critical", showing weak bug-severity judgment. - **Inconsistent quality:** Performance swings widely between excellent and poor; reviewers cannot predict whether a given answer will be thorough, partial, or incorrect. + +### Claude-Haiku-4.5 + +normalized score: **40.7** + +Strengths: + +- **Good format & clarity:** Consistently produces valid YAML and readable, minimally-intrusive patches with clear before/after snippets, so its outputs are easy to apply. +- **Basic bug-spotting ability:** Often detects the most obvious new-line defect (e.g., syntax error, missing guard, wrong constant) and supplies a correct, concise fix; rarely ranks last in the set. +- **Rule compliance in many cases:** Usually stays within the 3-suggestion limit, touches only '+' lines, and avoids speculative refactors—returning an empty list when no code was added. + +Weaknesses: + +- **Shallow coverage:** Frequently fixes just one surface-level issue and misses additional, higher-impact bugs that stronger reviewers catch, leaving regressions in place. +- **Occasional incorrect or no-op patches:** A noticeable share of suggestions either leave code unchanged, contain invalid code, or introduce new errors, lowering trust. +- **Guideline slips:** In several examples it edits unchanged lines, adds forbidden imports/version bumps, mis-labels severities, or supplies non-critical stylistic advice. +- **Inconsistent diligence:** Roughly a quarter of the cases return an empty list despite real problems, while others duplicate existing PR changes, indicating weak diff comprehension. + ### Claude-4 Sonnet (4096 thinking tokens) Final score: **39.7** From 604c17348d1ccbbf5cef1ef028b74516428e77ec Mon Sep 17 00:00:00 2001 From: ofir-frd Date: Thu, 16 Oct 2025 15:39:32 +0300 Subject: [PATCH 3/5] docs: remove unnecessary whitespace in index documentation --- docs/docs/pr_benchmark/index.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md index 1789d834..97a235a0 100644 --- a/docs/docs/pr_benchmark/index.md +++ b/docs/docs/pr_benchmark/index.md @@ -398,5 +398,3 @@ The PR benchmark dataset includes pull requests containing code in the following Pull requests may also include non-code files such as `YAML`, `JSON`, `Markdown`, `Dockerfile` ,`Shell`, etc. The benchmarked models should also analyze these files, as they commonly appear in real-world pull requests. - - From a6e11f60ce8ded171cff1d64842de98e8890d2e8 Mon Sep 17 00:00:00 2001 From: ofir-frd Date: Thu, 16 Oct 2025 15:49:44 +0300 Subject: [PATCH 4/5] docs: update Claude Haiku 4.5 benchmark score label from normalized to final --- docs/docs/pr_benchmark/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md index 97a235a0..30d0f120 100644 --- a/docs/docs/pr_benchmark/index.md +++ b/docs/docs/pr_benchmark/index.md @@ -215,7 +215,7 @@ weaknesses: ### Claude-Haiku-4.5 -normalized score: **40.7** +Final score: **40.7** Strengths: From 3dd373a77e7427500c334ff9bfb37d17fb731d34 Mon Sep 17 00:00:00 2001 From: Hussam Lawen Date: Thu, 16 Oct 2025 17:38:34 +0300 Subject: [PATCH 5/5] Update pr_agent/settings/pr_reviewer_prompts.toml Co-authored-by: qodo-merge-for-open-source[bot] <189517486+qodo-merge-for-open-source[bot]@users.noreply.github.com> --- pr_agent/settings/pr_reviewer_prompts.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pr_agent/settings/pr_reviewer_prompts.toml b/pr_agent/settings/pr_reviewer_prompts.toml index f13e1aa2..477f8553 100644 --- a/pr_agent/settings/pr_reviewer_prompts.toml +++ b/pr_agent/settings/pr_reviewer_prompts.toml @@ -217,7 +217,7 @@ Ticket Description: ##### {%- endif %} -{%- if ticket.requirements is defined %} +{%- if ticket.requirements is defined and ticket.requirements %} Ticket Requirements: ##### {{ ticket.requirements }}