More Software Engineering Radio episodes

Sahaj Garg on Low Latency AI thumbnail

Sahaj Garg on Low Latency AI

Published 14 Jan 2026

Duration: 54:47

A key factor in AI applications is reducing latency, which requires a balance between model size, accuracy, and performance, along with ongoing optimization and monitoring.

Episode Description

In this episode, Sahaj Garg, CTO of wispr.ai, joins SE Radio host Robert Blumen to talk about the challenges of building low-latency AI applications....

Overview

The podcast emphasizes the critical role of latency in applications, especially within AI systems, and how it affects user perception and overall experience. It explains that users are sensitive to both average and tail latency, and that acceptable thresholds vary depending on the specific application. The discussion explores various techniques aimed at reducing latency while preserving accuracy, such as speculative decoding and model distillation. It also outlines the challenges involved in managing AI workloads, including network delays, deployment configurations, and the necessity for ongoing monitoring and optimization.

The conversation further examines the trade-offs between model size, accuracy, and latency, highlighting the use of quantization and efficient compute architectures as ways to enhance performance. It underscores the importance of aligning AI outputs with user expectations, considering factors like the balance between response size and usability. The podcast also notes the value of tailoring AI systems to specific application contexts, such as voice dictation, recommendation systems, and interactive interfaces, to ensure optimal performance and user satisfaction.

Recent Episodes of Software Engineering Radio

13 May 2026 SE Radio 720: Martin Dilger on Understanding Eventsourcing

Recommended: Useful Architectural Pattern.

Event sourcing is a system design approach that records changes as sequential events to ensure historical traceability, uses event modeling for aligning systems with human workflows, contrasts with CRUD architectures, and emphasizes slice-based design, event streams, and practical applications like legacy modernization and workflow simplification.

6 May 2026 Birol Yildiz on Building an Agentic AI SRE

AI agents in SRE leverage autonomous decision-making, agentic search, and lightweight architectures to replace static runbooks, balancing autonomy with reliability challenges, context management, and human oversight in dynamic environments.

29 Apr 2026 Will Sentance on JS Modernization

JavaScript's evolution from a 1995 scripting language to a performance-optimized modern tool balances innovation with backward compatibility through TC39's incremental updates, browser advancements, community-driven libraries, key features like async/await and symbols, engine optimizations, and a design philosophy prioritizing flexibility and user-driven standardization for large-scale frameworks.

23 Apr 2026 Eric Tschetter on Decoupling Observability

Recommended: Telemetry is important, avoiding vendor lockin is even more important.

Observability in microservices emphasizes decoupled architectures over traditional frameworks to address vendor lock-in, data interoperability, and scalability challenges, while balancing unstructured telemetry management, query language standardization, and cross-team collaboration.

15 Apr 2026 Martin Kleppmann Local-First Software

Local First Software combines local data storage with cloud collaboration to enable offline access, real-time editing, and seamless syncing via AutoMerge and CRDTs, prioritizing user control, privacy, and decentralized workflows with future focus on open standards and AI integration.

More Software Engineering Radio episodes