diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md index 82266f81..136f17a2 100644 --- a/docs/docs/pr_benchmark/index.md +++ b/docs/docs/pr_benchmark/index.md @@ -34,6 +34,12 @@ A list of the models used for generating the baseline suggestions, and example r + + GPT-5.2 + 2025-12-11 + medium + 80.8 + GPT-5-pro 2025-10-06 @@ -183,6 +189,22 @@ A list of the models used for generating the baseline suggestions, and example r ## Results Analysis (Latest Additions) +### GPT-5.2 ('medium' thinking budget) + +Final score: **80.8** + +Strengths: + +- **Broad, context-aware coverage:** Frequently identifies multiple high-impact faults in the added lines and proposes fixes that surpass or equal the best prior answer in many cases (≈60 % of the 399 comparisons). +- **Actionable, minimal patches:** Tends to supply concise before/after code snippets that compile/run, keep changes local, and respect limits (≤3 suggestions, touched-lines only) – making the advice easy to apply. +- **Clear reasoning & prioritisation:** Usually explains why an issue is critical, ranks it properly (e.g., crash > style), and avoids clutter, resulting in focused reviews that align with real test failures. + +Weaknesses: + +- **Critical omissions remain common:** In a sizeable minority of examples the model overlooks the single most blocking error (e.g., compile-time break, nil-deref, enum mismatch), causing it to trail a sharper peer answer. +- **Occasional inaccurate or harmful fixes:** It sometimes introduces non-compiling code, speculative refactors, or misguided changes to unchanged lines, lowering reliability. +- **Inconsistent guideline adherence:** A non-trivial set of replies add off-scope edits, non-critical style nits, or empty suggestion lists when clear bugs exist, leading to avoidable downgrades. + ### GPT-5-pro Final score: **73.4**