pr-agent/docs/docs/pr_benchmark/index.md

# Qodo Merge Pull Request Benchmark

## Methodology

Qodo Merge PR Benchmark evaluates and compares the performance of two Large Language Models (LLMs) in analyzing pull request code and providing meaningful code suggestions.
Our diverse dataset comprises of 400 pull requests from over 100 repositories, spanning various programming languages and frameworks to reflect real-world scenarios.

- For each pull request, two distinct LLMs process the same prompt using the Qodo Merge `improve` tool, each generating two sets of responses. The prompt for response generation can be found [here](https://github.com/qodo-ai/pr-agent/blob/main/pr_agent/settings/code_suggestions/pr_code_suggestions_prompts_not_decoupled.toml).

- Subsequently, a high-performing third model (an AI judge) evaluates the responses from the initial two models to determine the superior one. We utilize OpenAI's `o3` model as the judge, though other models have yielded consistent results. The prompt for this comparative judgment is available [here](https://github.com/Codium-ai/pr-agent-settings/tree/main/benchmark).

- We aggregate comparison outcomes across all the pull requests, calculating the win rate for each model. We also analyze the qualitative feedback (the "why" explanations from the judge) to identify each model's comparative strengths and weaknesses.
This approach provides not just a quantitative score but also a detailed analysis of each model's strengths and weaknesses.

- The final output is a "Model Card", comparing the evaluated model against others. To ensure full transparency and enable community scrutiny, we also share the raw code suggestions generated by each model, and the judge's specific feedback.

Note that this benchmark focuses on quality: the ability of an LLM to process complex pull request with multiple files and nuanced task to produce high-quality code suggestions.
Other factors like speed, cost, and availability, while also relevant for model selection, are outside this benchmark's scope.

## TL;DR

Here's a summary of the win rates based on the benchmark:

[//]: # (| Model A                        | Model B                        | Model A Win Rate | Model B Win Rate |)

[//]: # (|:-------------------------------|:-------------------------------|:----------------:|:----------------:|)

[//]: # (| Gemini-2.5-pro-preview-05-06   | GPT-4.1                        |      70.4%       |      29.6%       |)

[//]: # (| Gemini-2.5-pro-preview-05-06   | Sonnet 3.7                     |      78.1%       |      21.9%       |)

[//]: # (| GPT-4.1                        | Sonnet 3.7                     |      61.0%       |      39.0%       |)

<table>
  <thead>
    <tr>
      <th style="text-align:left;">Model A</th>
      <th style="text-align:left;">Model B</th>
      <th style="text-align:center;">Model A Win Rate</th> <th style="text-align:center;">Model B Win Rate</th> </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align:left;">Gemini-2.5-pro-preview-05-06</td>
      <td style="text-align:left;">GPT-4.1</td>
      <td style="text-align:center; color: #1E8449;"><b>70.4%</b></td> <td style="text-align:center; color: #D8000C;"><b>29.6%</b></td> </tr>
    <tr>
      <td style="text-align:left;">Gemini-2.5-pro-preview-05-06</td>
      <td style="text-align:left;">Sonnet 3.7</td>
      <td style="text-align:center; color: #1E8449;"><b>78.1%</b></td> <td style="text-align:center; color: #D8000C;"><b>21.9%</b></td> </tr>
    <tr>
      <td style="text-align:left;">GPT-4.1</td>
      <td style="text-align:left;">Sonnet 3.7</td>
      <td style="text-align:center; color: #1E8449;"><b>61.0%</b></td> <td style="text-align:center; color: #D8000C;"><b>39.0%</b></td> </tr>
  </tbody>
</table>

## Gemini-2.5-pro-preview-05-06 - Model Card

### Comparison against GPT-4.1

![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768}

#### Analysis Summary

Model 'Gemini-2.5-pro-preview-05-06' is generally more useful thanks to wider and more accurate bug detection and concrete patches, but it sacrifices compliance discipline and sometimes oversteps the task rules. Model 'GPT-4.1' is safer and highly rule-abiding, yet often too timid—missing many genuine issues and providing limited insight. An ideal reviewer would combine 'GPT-4.1’ restraint with 'Gemini-2.5-pro-preview-05-06' thoroughness.

#### Detailed Analysis

Gemini-2.5-pro-preview-05-06 strengths:  

- better_bug_coverage: Detects and explains more critical issues, winning in ~70 % of comparisons and achieving a higher average score.  
- actionable_fixes: Supplies clear code snippets, correct language labels, and often multiple coherent suggestions per diff.  
- deeper_reasoning: Shows stronger grasp of logic, edge cases, and cross-file implications, leading to broader, high-impact reviews.  

Gemini-2.5-pro-preview-05-06 weaknesses:  

- guideline_violations: More prone to over-eager advice—non-critical tweaks, touching unchanged code, suggesting new imports, or minor format errors.  
- occasional_overreach: Some fixes are speculative or risky, potentially introducing new bugs.  
- redundant_or_duplicate: At times repeats the same point or exceeds the required brevity.  


### Comparison against Sonnet 3.7

![Comparison](https://codium.ai/images/qodo_merge_benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768}

#### Analysis Summary

Model 'Gemini-2.5-pro-preview-05-06' is the stronger reviewer—more frequently identifies genuine, high-impact bugs and provides well-formed, actionable fixes. Model 'Sonnet 3.7' is safer against false positives and tends to be concise but often misses important defects or offers low-value or incorrect suggestions.

See raw results [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06.md)


#### Detailed Analysis

Gemini-2.5-pro-preview-05-06 strengths:  

- higher_accuracy_and_coverage: finds real critical bugs and supplies actionable patches in most examples (better in 78 % of cases).  
- guideline_awareness: usually respects new-lines-only scope, ≤3 suggestions, proper YAML, and stays silent when no issues exist.  
- detailed_reasoning_and_patches: explanations tie directly to the diff and fixes are concrete, often catching multiple related defects that 'Sonnet 3.7' overlooks.

Gemini-2.5-pro-preview-05-06 weaknesses:  

- occasional_rule_violations: sometimes proposes new imports, package-version changes, or edits outside the added lines.  
- overzealous_suggestions: may add speculative or stylistic fixes that exceed the “critical” scope, or mis-label severity.  
- sporadic_technical_slips: a few patches contain minor coding errors, oversized snippets, or duplicate/contradicting advice.

## GPT-4.1 - Model Card

### Comparison against Sonnet 3.7

![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_sonnet_3.7_judge_o3.png){width=768}

#### Analysis Summary

Model 'GPT-4.1' is safer and more compliant, preferring silence over speculation, which yields fewer rule breaches and false positives but misses some real bugs.  
Model 'Sonnet 3.7' is more adventurous and often uncovers important issues that 'GPT-4.1' ignores, yet its aggressive style leads to frequent guideline violations and a higher proportion of incorrect or non-critical advice. 

See raw results [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/gpt-4.1_vs_sonnet_3.7_judge_o3.md)


#### Detailed Analysis

GPT-4.1 strengths:  
- Strong guideline adherence: usually stays strictly on `+` lines, avoids non-critical or stylistic advice, and rarely suggests forbidden imports; often outputs an empty list when no real bug exists.  
- Lower false-positive rate: suggestions are more accurate and seldom introduce new bugs; fixes compile more reliably.  
- Good schema discipline: YAML is almost always well-formed and fields are populated correctly.  

GPT-4.1 weaknesses:  
- Misses bugs: often returns an empty list even when a clear critical issue is present, so coverage is narrower.  
- Sparse feedback: when it does comment, it tends to give fewer suggestions and sometimes lacks depth or completeness.  
- Occasional metadata/slip-ups (wrong language tags, overly broad code spans), though less harmful than Sonnet 3.7 errors.  

### Comparison against Gemini-2.5-pro-preview-05-06

![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768}

#### Analysis Summary

Model 'Gemini-2.5-pro-preview-05-06' is generally more useful thanks to wider and more accurate bug detection and concrete patches, but it sacrifices compliance discipline and sometimes oversteps the task rules. Model 'GPT-4.1' is safer and highly rule-abiding, yet often too timid—missing many genuine issues and providing limited insight. An ideal reviewer would combine 'GPT-4.1’ restraint with 'Gemini-2.5-pro-preview-05-06' thoroughness.

#### Detailed Analysis

GPT-4.1 strengths: 
- strict_compliance: Usually sticks to the “critical bugs only / new ‘+’ lines only” rule, so outputs rarely violate task constraints.  
- low_risk: Conservative behaviour avoids harmful or speculative fixes; safer when no obvious issue exists.  
- concise_formatting: Tends to produce minimal, correctly-structured YAML without extra noise.  

GPT-4.1 weaknesses:
- under_detection: Frequently returns an empty list even when real bugs are present, missing ~70 % of the time.  
- shallow_analysis: When it does suggest fixes, coverage is narrow and technical depth is limited, sometimes with wrong language tags or minor format slips.  
- occasional_inaccuracy: A few suggestions are unfounded or duplicate, and rare guideline breaches (e.g., import advice) still occur.  


## Sonnet 3.7 - Model Card

### Comparison against GPT-4.1

![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_sonnet_3.7_judge_o3.png){width=768}

#### Analysis Summary

Model 'GPT-4.1' is safer and more compliant, preferring silence over speculation, which yields fewer rule breaches and false positives but misses some real bugs.  
Model 'Sonnet 3.7' is more adventurous and often uncovers important issues that 'GPT-4.1' ignores, yet its aggressive style leads to frequent guideline violations and a higher proportion of incorrect or non-critical advice. 

See raw results [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/gpt-4.1_vs_sonnet_3.7_judge_o3.md)

#### Detailed Analysis

'Sonnet 3.7' strengths:
- Better bug discovery breadth: more willing to dive into logic and spot critical problems that 'GPT-4.1' overlooks; often supplies multiple, detailed fixes.  
- Richer explanations & patches: gives fuller context and, when correct, proposes more functional or user-friendly solutions.  
- Generally correct language/context tagging and targeted code snippets.  

'Sonnet 3.7' weaknesses:
- Guideline violations: frequently flags non-critical issues, edits untouched code, or recommends adding imports, breaching task rules.  
- Higher error rate: suggestions are more speculative and sometimes introduce new defects or duplicate work already done.  
- Occasional schema or formatting mistakes (missing list value, duplicated suggestions), reducing reliability.  


### Comparison against Gemini-2.5-pro-preview-05-06

![Comparison](https://codium.ai/images/qodo_merge_benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768}

#### Analysis Summary

Model 'Gemini-2.5-pro-preview-05-06' is the stronger reviewer—more frequently identifies genuine, high-impact bugs and provides well-formed, actionable fixes. Model 'Sonnet 3.7' is safer against false positives and tends to be concise but often misses important defects or offers low-value or incorrect suggestions.

See raw results [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06.md)
-												Rename code fine-tuning benchmark to pull request benchmark and update model references

											
										
										
											2025-04-15 16:40:36 +00:00
+								# Qodo Merge Pull Request Benchmark
-												Add documentation for PR-Agent code fine-tuning benchmark and update mkdocs.yml

											
										
										
											2024-05-31 13:09:34 +00:00
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
+								## Methodology
-												Add documentation for PR-Agent code fine-tuning benchmark and update mkdocs.yml

											
										
										
											2024-05-31 13:09:34 +00:00
-												docs: add benchmark methodology and improve model comparison formatting

											
										
										
											2025-05-13 05:39:19 +00:00
+								Qodo Merge PR Benchmark evaluates and compares the performance of two Large Language Models (LLMs) in analyzing pull request code and providing meaningful code suggestions.
 								Our diverse dataset comprises of 400 pull requests from over 100 repositories, spanning various programming languages and frameworks to reflect real-world scenarios.
-												Add documentation for PR-Agent code fine-tuning benchmark and update mkdocs.yml

											
										
										
											2024-05-31 13:09:34 +00:00
-												docs: add benchmark methodology and improve model comparison formatting

											
										
										
											2025-05-13 05:39:19 +00:00
+								- For each pull request, two distinct LLMs process the same prompt using the Qodo Merge `improve` tool, each generating two sets of responses. The prompt for response generation can be found [here](https://github.com/qodo-ai/pr-agent/blob/main/pr_agent/settings/code_suggestions/pr_code_suggestions_prompts_not_decoupled.toml).
 								- Subsequently, a high-performing third model (an AI judge) evaluates the responses from the initial two models to determine the superior one. We utilize OpenAI's `o3` model as the judge, though other models have yielded consistent results. The prompt for this comparative judgment is available [here](https://github.com/Codium-ai/pr-agent-settings/tree/main/benchmark).
 								- We aggregate comparison outcomes across all the pull requests, calculating the win rate for each model. We also analyze the qualitative feedback (the "why" explanations from the judge) to identify each model's comparative strengths and weaknesses.
 								This approach provides not just a quantitative score but also a detailed analysis of each model's strengths and weaknesses.
 								- The final output is a "Model Card", comparing the evaluated model against others. To ensure full transparency and enable community scrutiny, we also share the raw code suggestions generated by each model, and the judge's specific feedback.
 								Note that this benchmark focuses on quality: the ability of an LLM to process complex pull request with multiple files and nuanced task to produce high-quality code suggestions.
 								Other factors like speed, cost, and availability, while also relevant for model selection, are outside this benchmark's scope.
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								## TL;DR
 								Here's a summary of the win rates based on the benchmark:
-												docs: enhance benchmark table with colored win rates and improve comparison headings

											
										
										
											2025-05-13 06:05:07 +00:00
+								[//]: # (| Model A                        | Model B                        | Model A Win Rate | Model B Win Rate |)
 								[//]: # (|:-------------------------------|:-------------------------------|:----------------:|:----------------:|)
 								[//]: # (| Gemini-2.5-pro-preview-05-06   | GPT-4.1                        |      70.4%       |      29.6%       |)
 								[//]: # (| Gemini-2.5-pro-preview-05-06   | Sonnet 3.7                     |      78.1%       |      21.9%       |)
 								[//]: # (| GPT-4.1                        | Sonnet 3.7                     |      61.0%       |      39.0%       |)
 								<table>
 								  <thead>
 								    <tr>
 								      <th style="text-align:left;">Model A</th>
 								      <th style="text-align:left;">Model B</th>
 								      <th style="text-align:center;">Model A Win Rate</th> <th style="text-align:center;">Model B Win Rate</th> </tr>
 								  </thead>
 								  <tbody>
 								    <tr>
 								      <td style="text-align:left;">Gemini-2.5-pro-preview-05-06</td>
 								      <td style="text-align:left;">GPT-4.1</td>
 								      <td style="text-align:center; color: #1E8449;"><b>70.4%</b></td> <td style="text-align:center; color: #D8000C;"><b>29.6%</b></td> </tr>
 								    <tr>
 								      <td style="text-align:left;">Gemini-2.5-pro-preview-05-06</td>
 								      <td style="text-align:left;">Sonnet 3.7</td>
 								      <td style="text-align:center; color: #1E8449;"><b>78.1%</b></td> <td style="text-align:center; color: #D8000C;"><b>21.9%</b></td> </tr>
 								    <tr>
 								      <td style="text-align:left;">GPT-4.1</td>
 								      <td style="text-align:left;">Sonnet 3.7</td>
 								      <td style="text-align:center; color: #1E8449;"><b>61.0%</b></td> <td style="text-align:center; color: #D8000C;"><b>39.0%</b></td> </tr>
 								  </tbody>
 								</table>
-												docs: add benchmark methodology and improve model comparison formatting

											
										
										
											2025-05-13 05:39:19 +00:00
 								## Gemini-2.5-pro-preview-05-06 - Model Card
-												Add documentation for PR-Agent code fine-tuning benchmark and update mkdocs.yml

											
										
										
											2024-05-31 13:09:34 +00:00
-												docs: enhance benchmark table with colored win rates and improve comparison headings

											
										
										
											2025-05-13 06:05:07 +00:00
+								### Comparison against GPT-4.1
-												docs: add Gemini-2.5-pro-preview vs GPT-4.1 benchmark comparison

											
										
										
											2025-05-12 06:53:59 +00:00
 								![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768}
 								#### Analysis Summary
 								Model 'Gemini-2.5-pro-preview-05-06' is generally more useful thanks to wider and more accurate bug detection and concrete patches, but it sacrifices compliance discipline and sometimes oversteps the task rules. Model 'GPT-4.1' is safer and highly rule-abiding, yet often too timid—missing many genuine issues and providing limited insight. An ideal reviewer would combine 'GPT-4.1’ restraint with 'Gemini-2.5-pro-preview-05-06' thoroughness.
-												docs: add benchmark methodology and improve model comparison formatting

											
										
										
											2025-05-13 05:39:19 +00:00
+								#### Detailed Analysis
-												docs: add Gemini-2.5-pro-preview vs GPT-4.1 benchmark comparison

											
										
										
											2025-05-12 06:53:59 +00:00
-												docs: improve model comparison headings in benchmark documentation

											
										
										
											2025-05-13 06:07:46 +00:00
+								Gemini-2.5-pro-preview-05-06 strengths:
-												docs: add Gemini-2.5-pro-preview vs GPT-4.1 benchmark comparison

											
										
										
											2025-05-12 06:53:59 +00:00
 								- better_bug_coverage: Detects and explains more critical issues, winning in ~70 % of comparisons and achieving a higher average score.
 								- actionable_fixes: Supplies clear code snippets, correct language labels, and often multiple coherent suggestions per diff.
 								- deeper_reasoning: Shows stronger grasp of logic, edge cases, and cross-file implications, leading to broader, high-impact reviews.
-												docs: improve model comparison headings in benchmark documentation

											
										
										
											2025-05-13 06:07:46 +00:00
+								Gemini-2.5-pro-preview-05-06 weaknesses:
-												docs: add Gemini-2.5-pro-preview vs GPT-4.1 benchmark comparison

											
										
										
											2025-05-12 06:53:59 +00:00
 								- guideline_violations: More prone to over-eager advice—non-critical tweaks, touching unchanged code, suggesting new imports, or minor format errors.
 								- occasional_overreach: Some fixes are speculative or risky, potentially introducing new bugs.
 								- redundant_or_duplicate: At times repeats the same point or exceeds the required brevity.
-												docs: enhance benchmark table with colored win rates and improve comparison headings

											
										
										
											2025-05-13 06:05:07 +00:00
+								### Comparison against Sonnet 3.7
-												Fix all markdownlint violations

											
										
										
											2024-06-01 01:09:41 +00:00
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
+								![Comparison](https://codium.ai/images/qodo_merge_benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768}
-												Add documentation for PR-Agent code fine-tuning benchmark and update mkdocs.yml

											
										
										
											2024-05-31 13:09:34 +00:00
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
+								#### Analysis Summary
-												Add documentation for PR-Agent code fine-tuning benchmark and update mkdocs.yml

											
										
										
											2024-05-31 13:09:34 +00:00
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
+								Model 'Gemini-2.5-pro-preview-05-06' is the stronger reviewer—more frequently identifies genuine, high-impact bugs and provides well-formed, actionable fixes. Model 'Sonnet 3.7' is safer against false positives and tends to be concise but often misses important defects or offers low-value or incorrect suggestions.
-												Optimize for fine-tuning impact table

											
										
										
											2024-06-01 01:11:34 +00:00
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
+								See raw results [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06.md)
-												Add documentation for PR-Agent code fine-tuning benchmark and update mkdocs.yml

											
										
										
											2024-05-31 13:09:34 +00:00
-												docs: add benchmark methodology and improve model comparison formatting

											
										
										
											2025-05-13 05:39:19 +00:00
+								#### Detailed Analysis
-												Add documentation for PR-Agent code fine-tuning benchmark and update mkdocs.yml

											
										
										
											2024-05-31 13:09:34 +00:00
-												docs: improve model comparison headings in benchmark documentation

											
										
										
											2025-05-13 06:07:46 +00:00
+								Gemini-2.5-pro-preview-05-06 strengths:
-												Add documentation for PR-Agent code fine-tuning benchmark and update mkdocs.yml

											
										
										
											2024-05-31 13:09:34 +00:00
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
+								- higher_accuracy_and_coverage: finds real critical bugs and supplies actionable patches in most examples (better in 78 % of cases).
 								- guideline_awareness: usually respects new-lines-only scope, ≤3 suggestions, proper YAML, and stays silent when no issues exist.
 								- detailed_reasoning_and_patches: explanations tie directly to the diff and fixes are concrete, often catching multiple related defects that 'Sonnet 3.7' overlooks.
-												Add documentation for PR-Agent code fine-tuning benchmark and update mkdocs.yml

											
										
										
											2024-05-31 13:09:34 +00:00
-												docs: improve model comparison headings in benchmark documentation

											
										
										
											2025-05-13 06:07:46 +00:00
+								Gemini-2.5-pro-preview-05-06 weaknesses:
-												Add documentation for PR-Agent code fine-tuning benchmark and update mkdocs.yml

											
										
										
											2024-05-31 13:09:34 +00:00
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
+								- occasional_rule_violations: sometimes proposes new imports, package-version changes, or edits outside the added lines.
 								- overzealous_suggestions: may add speculative or stylistic fixes that exceed the “critical” scope, or mis-label severity.
 								- sporadic_technical_slips: a few patches contain minor coding errors, oversized snippets, or duplicate/contradicting advice.
-												Add documentation for PR-Agent code fine-tuning benchmark and update mkdocs.yml

											
										
										
											2024-06-02 08:28:48 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								## GPT-4.1 - Model Card
-												Add documentation for PR-Agent code fine-tuning benchmark and update mkdocs.yml

											
										
										
											2024-06-02 08:28:48 +00:00
-												docs: enhance benchmark table with colored win rates and improve comparison headings

											
										
										
											2025-05-13 06:05:07 +00:00
+								### Comparison against Sonnet 3.7
-												Add documentation for PR-Agent code fine-tuning benchmark and update mkdocs.yml

											
										
										
											2024-05-31 13:09:34 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_sonnet_3.7_judge_o3.png){width=768}
-												Fix all markdownlint violations

											
										
										
											2024-06-01 01:09:41 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								#### Analysis Summary
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								Model 'GPT-4.1' is safer and more compliant, preferring silence over speculation, which yields fewer rule breaches and false positives but misses some real bugs.
 								Model 'Sonnet 3.7' is more adventurous and often uncovers important issues that 'GPT-4.1' ignores, yet its aggressive style leads to frequent guideline violations and a higher proportion of incorrect or non-critical advice.
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								See raw results [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/gpt-4.1_vs_sonnet_3.7_judge_o3.md)
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								#### Detailed Analysis
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												docs: improve model comparison headings in benchmark documentation

											
										
										
											2025-05-13 06:07:46 +00:00
+								GPT-4.1 strengths:
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								- Strong guideline adherence: usually stays strictly on `+` lines, avoids non-critical or stylistic advice, and rarely suggests forbidden imports; often outputs an empty list when no real bug exists.
 								- Lower false-positive rate: suggestions are more accurate and seldom introduce new bugs; fixes compile more reliably.
 								- Good schema discipline: YAML is almost always well-formed and fields are populated correctly.
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												docs: improve model comparison headings in benchmark documentation

											
										
										
											2025-05-13 06:07:46 +00:00
+								GPT-4.1 weaknesses:
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								- Misses bugs: often returns an empty list even when a clear critical issue is present, so coverage is narrower.
 								- Sparse feedback: when it does comment, it tends to give fewer suggestions and sometimes lacks depth or completeness.
 								- Occasional metadata/slip-ups (wrong language tags, overly broad code spans), though less harmful than Sonnet 3.7 errors.
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												docs: enhance benchmark table with colored win rates and improve comparison headings

											
										
										
											2025-05-13 06:05:07 +00:00
+								### Comparison against Gemini-2.5-pro-preview-05-06
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768}
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								#### Analysis Summary
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								Model 'Gemini-2.5-pro-preview-05-06' is generally more useful thanks to wider and more accurate bug detection and concrete patches, but it sacrifices compliance discipline and sometimes oversteps the task rules. Model 'GPT-4.1' is safer and highly rule-abiding, yet often too timid—missing many genuine issues and providing limited insight. An ideal reviewer would combine 'GPT-4.1’ restraint with 'Gemini-2.5-pro-preview-05-06' thoroughness.
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								#### Detailed Analysis
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								GPT-4.1 strengths:
 								- strict_compliance: Usually sticks to the “critical bugs only / new ‘+’ lines only” rule, so outputs rarely violate task constraints.
 								- low_risk: Conservative behaviour avoids harmful or speculative fixes; safer when no obvious issue exists.
 								- concise_formatting: Tends to produce minimal, correctly-structured YAML without extra noise.
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								GPT-4.1 weaknesses:
 								- under_detection: Frequently returns an empty list even when real bugs are present, missing ~70 % of the time.
 								- shallow_analysis: When it does suggest fixes, coverage is narrow and technical depth is limited, sometimes with wrong language tags or minor format slips.
 								- occasional_inaccuracy: A few suggestions are unfounded or duplicate, and rare guideline breaches (e.g., import advice) still occur.
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								## Sonnet 3.7 - Model Card
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												docs: enhance benchmark table with colored win rates and improve comparison headings

											
										
										
											2025-05-13 06:05:07 +00:00
+								### Comparison against GPT-4.1
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_sonnet_3.7_judge_o3.png){width=768}
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								#### Analysis Summary
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								Model 'GPT-4.1' is safer and more compliant, preferring silence over speculation, which yields fewer rule breaches and false positives but misses some real bugs.
 								Model 'Sonnet 3.7' is more adventurous and often uncovers important issues that 'GPT-4.1' ignores, yet its aggressive style leads to frequent guideline violations and a higher proportion of incorrect or non-critical advice.
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								See raw results [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/gpt-4.1_vs_sonnet_3.7_judge_o3.md)
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								#### Detailed Analysis
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								'Sonnet 3.7' strengths:
 								- Better bug discovery breadth: more willing to dive into logic and spot critical problems that 'GPT-4.1' overlooks; often supplies multiple, detailed fixes.
 								- Richer explanations & patches: gives fuller context and, when correct, proposes more functional or user-friendly solutions.
 								- Generally correct language/context tagging and targeted code snippets.
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								'Sonnet 3.7' weaknesses:
 								- Guideline violations: frequently flags non-critical issues, edits untouched code, or recommends adding imports, breaching task rules.
 								- Higher error rate: suggestions are more speculative and sometimes introduce new defects or duplicate work already done.
 								- Occasional schema or formatting mistakes (missing list value, duplicated suggestions), reducing reliability.
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												docs: enhance benchmark table with colored win rates and improve comparison headings

											
										
										
											2025-05-13 06:05:07 +00:00
+								### Comparison against Gemini-2.5-pro-preview-05-06
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								![Comparison](https://codium.ai/images/qodo_merge_benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768}
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								#### Analysis Summary
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								Model 'Gemini-2.5-pro-preview-05-06' is the stronger reviewer—more frequently identifies genuine, high-impact bugs and provides well-formed, actionable fixes. Model 'Sonnet 3.7' is safer against false positives and tends to be concise but often misses important defects or offers low-value or incorrect suggestions.
-												docs: add Gemini-2.5-pro-preview model comparison to benchmark documentation

											
										
										
											2025-05-12 05:11:57 +00:00
-												s

											
										
										
											2025-05-13 05:53:03 +00:00
+								See raw results [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06.md)