More Dev Interrupted episodes

Reading model benchmarks like a pro, Mythos is looming, and Claude talk caveman, save big token thumbnail

Reading model benchmarks like a pro, Mythos is looming, and Claude talk caveman, save big token

Published 10 Apr 2026

Duration: 00:30:45

Anthropic's Claude Mythos drives AI advancements amid cybersecurity concerns and an escalating arms race, with Project Glasswing using it to detect software flaws, while discussions explore evaluation challenges, open-source trends, edge deployment, user-friendly interfaces, and AI's role in real-world problem-solving.

Episode Description

Is the secret to slashing your token costs by 65% forcing your LLM to speak like a caveman? This week on the Friday Deploy, Andrew and Ben test out a...

Overview

The podcast discusses advancements in AI, focusing on Anthropic's Claude Mythos, a model raising concerns in cybersecurity due to its potential misuse in identifying software vulnerabilities, such as those in FFmpeg. The discussion highlights an escalating arms race between AI-powered attackers and defenders, with Anthropic partnering on the Project Glasswing initiative to proactively detect software flaws using AI, backed by a $100 million investment in computational resources. Strategic motivations for this initiative include addressing the imbalance between attackers and defenders and expanding infrastructure for large models like Mythos, which demand significant computational resources. Challenges are noted in balancing AI-driven security risks against its potential to resolve existing issues, while industry leaders emphasize the need for collaborative efforts to address global challenges through AI.

The episode also explores benchmarking AI models, critiquing outdated metrics and advocating for benchmarks that distinguish AI from human performance, such as ARC AGI 3, where even top models score under 1% compared to humans. Human intuition and creativity, such as game design or humor, are highlighted as uniquely complex traits hard to replicate in AI, prompting ideas like puzzle-gated communities to leverage human intuition. Recent open-source models like Gemma 4 and Hollow 3 are discussed, emphasizing cost efficiency and accessibility, with trends pointing toward local AI deployment on affordable hardware. The commoditization of AI capabilities and the growing shift toward self-hosted models are noted as industry trends, reducing reliance on expensive cloud-based solutions.

Innovations like the "Caveman Plugin" for simplifying AI outputs are reviewed, balancing reduced token costs against risks of degraded reasoning. The discussion also touches on practical applications of AI agents in hackathons and conferences, such as automating task management and fine-tuning models for specific challenges. Challenges include adapting operating systems to better integrate AI agents and refining workflows to use specialized models for different tasks. The episode emphasizes the need for frameworks like Apex to measure AI's impact on engineering productivity, alongside calls for user-centric design to make AI interactions more efficient without compromising clarity or functionality.

Recent Episodes of Dev Interrupted

24 Mar 2026 Why AI-assisted PRs merge at half the rate of human code | LinearBs 2026 Benchmarks

The 2026 Engineering Benchmark Report reveals that while 88.3% of developers use AI regularly, AI-generated pull requests face low merge rates (32.7%), larger sizes, and prolonged reviews due to systemic issues like poor data quality, inadequate policies, and organizational gaps, emphasizing the need for governance, smaller focused PRs, and foundational practices to optimize AI's potential in engineering workflows.

More Dev Interrupted episodes