diginoron - digital transformation PRO
diginoron



·
AI & ML interests
None yet
Recent Activity
replied to
ibragim-bad's
post
10 days ago
We tested Qwen3-Coder, GPT-5 and other 30+ models on new SWE-Bench like tasks from July 2025!
Hi all, I’m Ibragim from Nebius.
We ran a benchmark on 34 fresh GitHub PR tasks from July 2025 using the SWE-rebench leaderboard https://swe-rebench.com/leaderboard . These are real, recent problems — no training-set contamination — and include both proprietary and open-source models.
Quick takeaways:
> GPT-5-Medium leads overall (29.4% resolved rate, 38.2% pass@5).
> Qwen3-Coder is the best open-source performer, matching GPT-5-High in pass@5 (32.4%) despite a lower resolved rate.
> Claude Sonnet 4.0 lags behind in pass@5 at 23.5%.
All tasks come from the continuously updated, decontaminated nebius/SWE-rebench-leaderboard for real-world SWE tasks.