docs: update evaluation dataset size in finetuning benchmark documentation

2025-12-13 03:15:17 +00:00 · 2025-03-04 08:24:48 +02:00 · 2025-03-04 08:24:48 +02:00 · bb24a6f43d
commit bb24a6f43d
parent d3f83f3069
1 changed files with 1 additions and 1 deletions
--- a/docs/docs/finetuning_benchmark/index.md
+++ b/docs/docs/finetuning_benchmark/index.md
@ -68,7 +68,7 @@ Here are the prompts, and example outputs, used as input-output pairs to fine-tu

 ### Evaluation dataset

- For each tool, we aggregated 100 additional examples to be used for evaluation. These examples were not used in the training dataset, and were manually selected to represent diverse real-world use-cases.
+- For each tool, we aggregated 200 additional examples to be used for evaluation. These examples were not used in the training dataset, and were manually selected to represent diverse real-world use-cases.
 - For each test example, we generated two responses: one from the fine-tuned model, and one from the best code model in the world, `gpt-4-turbo-2024-04-09`.

 - We used a third LLM to judge which response better answers the prompt, and will likely be perceived by a human as better response.