More Latent Space episodes

The End of SWE-Bench Verified  Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data thumbnail

The End of SWE-Bench Verified Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

Published 23 Feb 2026

Duration: 1572

C-bench Verified, a coding benchmark, has faced challenges such as task saturation, biased tasks, and overlapping training data, prompting the need for more advanced alternatives and a reevaluation of broader issues in AI coding evaluation.

Episode Description

Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment teams) discuss a new blog...

Overview

The podcast explores the creation and evolution of C-bench Verified, a coding benchmark developed by OpenAI to assess AI systems' ability to solve real-world coding problems derived from GitHub issues. Initially a major undertaking involving careful curation and expert review, the benchmark has encountered several challenges over time, including task saturation, contamination from overlapping training data with open-source repositories, and issues with task design that introduce bias and affect the fairness of evaluations. These limitations have prompted a move toward more advanced benchmarks like C-bench Pro, which aim to assess deeper and more complex coding abilities.

The discussion also highlights broader issues in evaluating AI coding capabilities, such as the difficulty of measuring code quality effectively and the need for more realistic and comprehensive benchmarks that reflect actual coding challenges. The podcast emphasizes the importance of transparency and collaboration in developing shared evaluation standards. Other topics include contamination within AI models, the tendency of traditional benchmarks to become outdated, and the need for future benchmarks to move beyond simple task completion and better align with real-world coding demands.

Recent Episodes of Latent Space

5 May 2026 Doing Vibe Physics Alex Lupsasca, OpenAI

AI is advancing theoretical physics by rapidly solving complex problems like quantum field theory calculations and simulating models such as SYK, though it still relies on human collaboration for original insights and contextual validation, reshaping research methodologies and education.

23 Apr 2026 AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

The text discusses AI's evolving landscape, focusing on experimental agents potentially breaking containment by 2026, market disruptions from foundation models, infrastructure advancements like RAG, debates between infrastructure and application firms, outsourcing strategies, pre-2023 training data advantages, competitive coding AI sectors, and future trends in personalization and industry transformation amid scalability and quality challenges.

More Latent Space episodes