mirror of
https://github.com/qodo-ai/pr-agent.git
synced 2025-12-12 02:45:18 +00:00
docs: add GPT-5-pro benchmark results to PR benchmark documentation
This commit is contained in:
parent
40ff5db659
commit
f88b7ffb4d
1 changed files with 25 additions and 1 deletions
|
|
@ -34,6 +34,12 @@ A list of the models used for generating the baseline suggestions, and example r
|
||||||
</tr>
|
</tr>
|
||||||
</thead>
|
</thead>
|
||||||
<tbody>
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td style="text-align:left;">GPT-5-pro</td>
|
||||||
|
<td style="text-align:left;">2025-10-06</td>
|
||||||
|
<td style="text-align:left;"></td>
|
||||||
|
<td style="text-align:center;"><b>73.4</b></td>
|
||||||
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td style="text-align:left;">GPT-5</td>
|
<td style="text-align:left;">GPT-5</td>
|
||||||
<td style="text-align:left;">2025-08-07</td>
|
<td style="text-align:left;">2025-08-07</td>
|
||||||
|
|
@ -153,6 +159,24 @@ A list of the models used for generating the baseline suggestions, and example r
|
||||||
|
|
||||||
## Results Analysis
|
## Results Analysis
|
||||||
|
|
||||||
|
### GPT-5-pro
|
||||||
|
|
||||||
|
Final score: **73.4**
|
||||||
|
|
||||||
|
Strengths:
|
||||||
|
|
||||||
|
- **High bug‐finding accuracy and depth:** In many cases the model uncovers the core compile-time or run-time regression that other answers miss and frequently combines several distinct critical issues into one reply.
|
||||||
|
- **Actionable, minimal patches:** Suggestions almost always include clear before/after code blocks that touch only the added lines and respect the ≤3-suggestion limit, making them easy to apply.
|
||||||
|
- **Good guideline compliance:** The model generally honours the task rules—no edits to unchanged code, no version bumps, no more than three items—and shows solid judgment about when an empty list is appropriate.
|
||||||
|
- **Concise, impact-oriented reasoning:** Explanations focus on severity, crash potential and build breakage rather than style, helping reviewers prioritise fixes.
|
||||||
|
|
||||||
|
Weaknesses:
|
||||||
|
|
||||||
|
- **Coverage gaps:** In a noticeable minority of examples the model misses a higher-impact defect that several other answers catch, or returns an empty list despite clear bugs.
|
||||||
|
- **Occasional incorrect or harmful fixes:** A few replies introduce new errors or rest on wrong assumptions about functionality or language-specific behavior.
|
||||||
|
- **Formatting / guideline slips:** Sporadic duplication of suggestions, missing or empty `improved_code` blocks, or YAML mishaps undermine otherwise good answers.
|
||||||
|
- **Uneven criticality judgement:** Some suggestions drift into low-impact territory while overlooking more severe problems, indicating inconsistent prioritisation.
|
||||||
|
|
||||||
### O3
|
### O3
|
||||||
|
|
||||||
Final score: **62.5**
|
Final score: **62.5**
|
||||||
|
|
@ -387,7 +411,7 @@ Strengths:
|
||||||
|
|
||||||
- **Consistent format & guideline obedience:** Output is almost always valid YAML, within the 3-suggestion limit, and rarely touches lines not prefixed with "+".
|
- **Consistent format & guideline obedience:** Output is almost always valid YAML, within the 3-suggestion limit, and rarely touches lines not prefixed with "+".
|
||||||
- **Low false-positive rate:** When no real defect exists, the model correctly returns an empty list instead of inventing speculative fixes, avoiding the "noise" many baseline answers add.
|
- **Low false-positive rate:** When no real defect exists, the model correctly returns an empty list instead of inventing speculative fixes, avoiding the "noise" many baseline answers add.
|
||||||
- **Clear, concise patches when it does act:** In the minority of cases where it detects a bug (e.g., ex-13, 46, 212), the fix is usually correct, minimal, and easy to apply.
|
- **Clear, concise patches when it does act:** In the minority of cases where it detects a bug, the fix is usually correct, minimal, and easy to apply.
|
||||||
|
|
||||||
Weaknesses:
|
Weaknesses:
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue