More Latent Space episodes

The End of SWE-Bench Verified  Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data thumbnail

The End of SWE-Bench Verified Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

Published 23 Feb 2026

Duration: 1572

C-bench Verified, a coding benchmark, has faced challenges such as task saturation, biased tasks, and overlapping training data, prompting the need for more advanced alternatives and a reevaluation of broader issues in AI coding evaluation.

Episode Description

Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment teams) discuss a new blog...

Overview

The podcast explores the creation and evolution of C-bench Verified, a coding benchmark developed by OpenAI to assess AI systems' ability to solve real-world coding problems derived from GitHub issues. Initially a major undertaking involving careful curation and expert review, the benchmark has encountered several challenges over time, including task saturation, contamination from overlapping training data with open-source repositories, and issues with task design that introduce bias and affect the fairness of evaluations. These limitations have prompted a move toward more advanced benchmarks like C-bench Pro, which aim to assess deeper and more complex coding abilities.

The discussion also highlights broader issues in evaluating AI coding capabilities, such as the difficulty of measuring code quality effectively and the need for more realistic and comprehensive benchmarks that reflect actual coding challenges. The podcast emphasizes the importance of transparency and collaboration in developing shared evaluation standards. Other topics include contamination within AI models, the tendency of traditional benchmarks to become outdated, and the need for future benchmarks to move beyond simple task completion and better align with real-world coding demands.

Recent Episodes of Latent Space

22 Jun 2026 Red-Teaming after Mythos Zico Kolter & Matt Fredrikson, Gray Swan

AI security challenges in large language models, such as data leakage and prompt injection, require adversarial testing, red teaming, tools like *Shade* and *Signal*, and structured frameworks to address integration risks, robustness gaps, and enterprise-specific security demands.

3 Jun 2026 Scaling Past Informal AI - Carina Hong, Axiom Math

Formal verification is positioned as a critical tool for advancing AI by ensuring system correctness through mathematical rigor, exemplified by Axiom Math's achievements, tools like Lean, challenges in AI generalization, and the vision of AI as a "superhuman mathematician" through verified reasoning.

3 Jun 2026 Satya Nadella: No Priors x Latent Space Crossover Special at Microsoft Build

Strategic AI development shifts to ecosystem-driven frameworks prioritizing value creation, covering Microsoft's rigorous model training, agent-driven workflow management, real-world impact challenges, innovative business models, inclusive AI participation, and redefining work through agentic systems.

2 Jun 2026 GitHub's plan for Agents Kyle Daigle, GitHub

Advanced AI integration in developer workflows leverages tools like GitHub Copilot and agentic systems to automate tasks and boost productivity, while addressing challenges like skill bloat, security, open-source trust issues, and the shift to modular AI capabilities in enterprise and collaborative environments.

1 Jun 2026 Why Video Agent models are next Ethan He, xAI Grok Imagine

Advancements in AI research through community-driven knowledge sharing, challenges in scaling video models, technical innovations like vision transformers and diffusion models, and the integration of language models in generative media, alongside hurdles in training efficiency and sustainable development.

More Latent Space episodes