- benchmark verbose path: pass on_progress callback the same way as
the non-verbose path (was missing entirely)
- _train_new_agents: replace per-case asyncio.run() with a single
event loop (new_event_loop / run_until_complete / close) to avoid
creating and destroying a loop on every case iteration
- format_results_table: use case_index + 1 so the '#' column is
1-based, matching the display in _test_new_agents failed output
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>