AI incidents, audits, and the limits of benchmarks

Published 13 Feb 2026

Show Notes: share.transistor.fm/s/1b8e65f4

Duration: 2572

Experts highlight the need for robust AI safety measures, including developing methods to catalog and prevent AI incidents, and using data and third-party audits to identify and address flaws.

Episode Description

AI is moving fast from research to real-world deployment, and when things go wrong, the consequences are no longer hypothetical. In this episode, Sean...

Overview

The podcast emphasizes the critical need for AI safety, focusing on the difficulties in defining and recording AI-related incidents. It highlights the AI Incident Database, which compiles over 5,000 annotated reports of AI failures to prevent the recurrence of similar issues, inspired by safety practices in other industries. The discussion addresses the shortcomings of current benchmarking methods, the benefits of third-party audits, and risks that arise from improper AI system configurations.

The content also underscores the importance of distinguishing between intentional and unintentional failures in AI systems, and the role of statistical validation in detecting broader systemic weaknesses. It calls for the development of standardized reporting tools and procedures to enhance AI safety. Additionally, insights from the Generative Red Team Challenge at DEF CON are mentioned, where structured testing by hackers exposed significant security flaws in model design and integration processes.

Recent Episodes of Practical AI

14 May 2026 U.S. Congressman Beyer on AI challenges facing America and the World

AI policy debates, cybersecurity vulnerabilities, economic disruptions, ethical risks, international collaboration, and philosophical questions on AI consciousness and human alignment dominate discussions on balancing innovation with governance and societal impact.

7 May 2026 The Myth of Model Wars: Open vs Closed AI in 2026

AI integration into physical systems via embedded tech in retail, manufacturing, and logistics is driven by microelectronics democratizing access, emphasizing infrastructure and edge applications over model types, while navigating challenges in scalability, tooling, and aligning AI with real-world business needs.

23 Apr 2026 The mythos of Mythos and Allbirds takes flight to the neocloud

Allbirds' shift to AI compute infrastructure amid financial struggles and a 700% stock surge sparks discussions on neocloud scalability, embedded AI trends in retail/manufacturing, Anthropic's Mythos AI usage, ethical risks of AI-generated content, token maxing critiques, and calls for improved governance and legal frameworks to address AI efficiency and security challenges.

16 Apr 2026 Open Source Self-Driving with Comma AI

OpenPilot, an open-source self-driving system, evolves from a niche project to a GitHub leader through end-to-end imitation learning and diffusion-based simulation, contrasting with commercial systems by prioritizing innovation over scalability, while facing hardware and adaptability challenges in advancing autonomous driving.

9 Apr 2026 Post-Mortem of Anthropic's Claude Code Leak

A 2026 leak of Anthropic's Claude codebase, via a malicious Axios package and exposed internal tools, exposed critical AI safety risks, supply chain vulnerabilities, and the outsized importance of the "agent harness" infrastructure in enabling advanced capabilities beyond model weights.

More Practical AI episodes