The End of SWE-Bench Verified Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

Published 23 Feb 2026

Show Notes: latent.space/p/swe-bench-dead

Duration: 1572

C-bench Verified, a coding benchmark, has faced challenges such as task saturation, biased tasks, and overlapping training data, prompting the need for more advanced alternatives and a reevaluation of broader issues in AI coding evaluation.

Episode Description

Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment teams) discuss a new blog...

Overview

The podcast explores the creation and evolution of C-bench Verified, a coding benchmark developed by OpenAI to assess AI systems' ability to solve real-world coding problems derived from GitHub issues. Initially a major undertaking involving careful curation and expert review, the benchmark has encountered several challenges over time, including task saturation, contamination from overlapping training data with open-source repositories, and issues with task design that introduce bias and affect the fairness of evaluations. These limitations have prompted a move toward more advanced benchmarks like C-bench Pro, which aim to assess deeper and more complex coding abilities.

The discussion also highlights broader issues in evaluating AI coding capabilities, such as the difficulty of measuring code quality effectively and the need for more realistic and comprehensive benchmarks that reflect actual coding challenges. The podcast emphasizes the importance of transparency and collaboration in developing shared evaluation standards. Other topics include contamination within AI models, the tendency of traditional benchmarks to become outdated, and the need for future benchmarks to move beyond simple task completion and better align with real-world coding demands.

Recent Episodes of Latent Space

5 May 2026 Doing Vibe Physics Alex Lupsasca, OpenAI

AI is advancing theoretical physics by rapidly solving complex problems like quantum field theory calculations and simulating models such as SYK, though it still relies on human collaboration for original insights and contextual validation, reshaping research methodologies and education.

27 Apr 2026 Physical AI that Moves the World Qasar Younis & Peter Ludwig, Applied Intuition

Applied Intuition develops safety-critical physical AI for automotive, construction, mining, and defense sectors, selling AI technology to manufacturers and governments through simulation, infrastructure, and proprietary systems to advance industrial innovation with reliable autonomy.

23 Apr 2026 AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

The text discusses AI's evolving landscape, focusing on experimental agents potentially breaking containment by 2026, market disruptions from foundation models, infrastructure advancements like RAG, debates between infrastructure and application firms, outsourcing strategies, pre-2023 training data advantages, competitive coding AI sectors, and future trends in personalization and industry transformation amid scalability and quality challenges.

22 Apr 2026 Shopifys AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym with Mikhail Parakhin, Shopify CTO

Shopify's AI strategies involve in-house tools like Tangled and QMD to automate workflows, collaborate with the AI community, address challenges in token usage and code quality, and explore applications in e-commerce, CI/CD optimization, and scalable AI experimentation.

15 Apr 2026 Notions Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future Simon Last & Sarah Sachs of Notion

CLIs and MCPs are emphasized for enterprise efficiency, alongside challenges in early AI integration, custom agent development for automation, strategic AGI management, and balancing automation with oversight, pricing, and collaboration tools like Notion.

More Latent Space episodes