More Latent Space episodes

The End of SWE-Bench Verified  Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data thumbnail

The End of SWE-Bench Verified Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

Published 23 Feb 2026

Duration: 1572

C-bench Verified, a coding benchmark, has faced challenges such as task saturation, biased tasks, and overlapping training data, prompting the need for more advanced alternatives and a reevaluation of broader issues in AI coding evaluation.

Episode Description

Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment teams) discuss a new blog...

Overview

The podcast explores the creation and evolution of C-bench Verified, a coding benchmark developed by OpenAI to assess AI systems' ability to solve real-world coding problems derived from GitHub issues. Initially a major undertaking involving careful curation and expert review, the benchmark has encountered several challenges over time, including task saturation, contamination from overlapping training data with open-source repositories, and issues with task design that introduce bias and affect the fairness of evaluations. These limitations have prompted a move toward more advanced benchmarks like C-bench Pro, which aim to assess deeper and more complex coding abilities.

The discussion also highlights broader issues in evaluating AI coding capabilities, such as the difficulty of measuring code quality effectively and the need for more realistic and comprehensive benchmarks that reflect actual coding challenges. The podcast emphasizes the importance of transparency and collaboration in developing shared evaluation standards. Other topics include contamination within AI models, the tendency of traditional benchmarks to become outdated, and the need for future benchmarks to move beyond simple task completion and better align with real-world coding demands.

Recent Episodes of Latent Space

20 Mar 2026 Dreamer: the Personal Agent OS David Singleton

Dreamer is an AI platform democratizing access to agentic tools for non-technical users via customizable AI assistants, community-built apps, cross-device integration, and privacy-focused features, with a beta emphasis on accessibility, real-world productivity use cases, and third-party developer opportunities.

More Latent Space episodes