mirror of
https://github.com/qodo-ai/pr-agent.git
synced 2025-12-11 18:35:18 +00:00
Merge pull request #2118 from qodo-ai/of/opus-4.5-benchmark
Add Claude Opus 4.5 to PR Banchmark
This commit is contained in:
commit
5ec92b3535
1 changed files with 29 additions and 6 deletions
|
|
@ -166,6 +166,12 @@ A list of the models used for generating the baseline suggestions, and example r
|
|||
<td style="text-align:left;"></td>
|
||||
<td style="text-align:center;"><b>32.4</b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left;">Claude-opus-4.5</td>
|
||||
<td style="text-align:left;">2025-11-01</td>
|
||||
<td style="text-align:left;">high</td>
|
||||
<td style="text-align:center;"><b>30.3</b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left;">GPT-4.1</td>
|
||||
<td style="text-align:left;">2025-04-14</td>
|
||||
|
|
@ -434,6 +440,23 @@ Weaknesses:
|
|||
- **Limited breadth:** Even when it finds a real defect it rarely reports additional related problems that peers catch, leading to partial reviews.
|
||||
- **Occasional guideline slips:** A few replies modify unchanged lines, suggest new imports, or duplicate suggestions, showing imperfect compliance with instructions.
|
||||
|
||||
### Claude-Opus-4.5 (high thinking budget)
|
||||
|
||||
Final score: **30.3**
|
||||
|
||||
Strengths:
|
||||
|
||||
- **High rule compliance & formatting:** Consistently produces valid YAML, respects the ≤3-suggestion limit, and usually confines edits to added lines, avoiding many guideline violations seen in peers.
|
||||
- **Low false-positive rate:** Tends to stay silent unless convinced of a real problem; when the diff is a pure version bump / docs tweak it often (correctly) returns an empty list, beating noisier baselines.
|
||||
- **Clear, focused patches when it fires:** In the minority of cases where it does spot a bug, it explains the issue crisply and supplies concise, copy-paste-able code snippets.
|
||||
|
||||
Weaknesses:
|
||||
|
||||
- **Very low recall:** In the vast majority of examples it misses obvious critical issues or suggests only a subset, frequently returning an empty list; this places it below most baselines on overall usefulness.
|
||||
- **Shallow coverage:** Even when it catches a defect it typically lists a single point and overlooks other high-impact problems present in the same diff.
|
||||
- **Occasional incorrect or incomplete fixes:** A non-trivial number of suggestions are wrong, compile-breaking, duplicate unchanged code, or touch out-of-scope lines, reducing trust.
|
||||
- **Inconsistent severity tagging & duplication:** Sometimes mis-labels critical vs general, repeats the same suggestion, or leaves `improved_code` blocks empty.
|
||||
|
||||
## Appendix - Example Results
|
||||
|
||||
Some examples of benchmarked PRs and their results:
|
||||
|
|
|
|||
Loading…
Reference in a new issue