More The TWIML AI Podcast episodes

How to Engineer AI Inference Systems with Philip Kiely thumbnail

How to Engineer AI Inference Systems with Philip Kiely

Published 30 Apr 2026

Duration: 00:54:47

AI inference deployment is accelerating, emphasizing inference engineering's critical role in optimizing generative models with advanced hardware and complex systems, while addressing challenges like latency, scalability, and modality-specific optimizations amid evolving industry trends and fragmented yet open-source-driven markets.

Episode Description

In this episode, Philip Kiely, head of AI education at Baseten, joins us to unpack the fast-evolving discipline of inference engineering. We explore w...

Overview

The podcast delves into the rapid evolution of AI inference compared to traditional fields like medicine and physics, where model training typically takes weeks or months, while AI inference can occur within hours. It emphasizes the growing importance of inference engineering, which focuses on deploying and optimizing large generative models in real-time, particularly as models scale to billions of parameters. This shift underscores the distinction between inferencecentral to AI-native companiesand earlier ML ops trends, as the complexity of inference increases with hardware demands, distributed systems, and strict latency requirements. Key challenges include managing technical limitations like insufficient compute resources, system orchestration, and the need for interdisciplinary expertise in GPU programming, quantization, and model parallelism. The field is also marked by a fast research-to-implementation cycle, rivaling industries like high-frequency trading in speed.

The discussion highlights how inference engineering has evolved from a niche concern to a critical industry standard, driven by the practical needs of deploying AI at scale. Companies across various sectors increasingly recognize the necessity of inference strategies to balance performance, cost, and reliability, impacting user experience and competitive advantage. The podcast outlines the spectrum of inference control, from limited user customization in closed systems to full flexibility in self-hosted deployments. It also addresses the transition from pay-per-token models to GPU-based infrastructure, influenced by cost, capacity, and scalability. Additionally, the role of specialized hardware, such as Hopper GPUs, and the fragmented yet advancing open-source ecosystem for inference tools (e.g., VLLM, TensorRT) are explored, alongside trends like compute disaggregation and modality-specific optimizations for tasks like vision or text-to-speech.

Finally, the content addresses the growing demand for inference engineers, driven by the complexity of deploying and optimizing AI systems. It emphasizes the need for interdisciplinary expertise to integrate applied research with infrastructure, while also acknowledging the limits of full automation due to hardware-specific optimizations. The podcast further touches on the future of inference, including the specialization of systems for task-specific workloads, the rise of agent-based systems requiring real-time inference, and the challenges of multimodal models. As AI models become more integral to product development, the strategic role of inference engineering in enabling efficient, reliable, and scalable AI applications is underscored, with implications for businesses seeking competitive differentiation through model-level innovation.

Recent Episodes of The TWIML AI Podcast

9 Jun 2026 Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut

The podcast examines Retrieval-Augmented Generation's evolving role in AI-driven tax compliance, focusing on Spheres AI's TRAM model, challenges in processing fragmented legal data, and the need for accurate citations, taxonomy integration, and real-time compliance automation via a global tax legislation index.

21 May 2026 Relational Foundation Models for Enterprise Data with Jure Leskovec

Relational foundation models and graph-based machine learning, like GNNs, enable accurate predictions on structured data across biomedical research and industries by capturing complex relationships, integrating multi-scale data, and overcoming traditional limitations through automated feature extraction and hybrid modeling.

7 May 2026 How to Find the Agent Failures Your Evals Miss with Scott Clark

Distributional employs post-production analytics, unsupervised learning, and LLMs to analyze agent traces, detect patterns and anti-patterns like hallucinations, address distributional shifts, and generate actionable insights for AI system refinement in security and enterprise settings, emphasizing adaptive analytics and domain expertise.

16 Apr 2026 How Capital One Delivers Multi-Agent Systems with Rashmi Shetty

Capital One's *Chat Concierge* multi-agentic AI system streamlines car-buying through self-reflection, real-time APIs, and LLM-driven workflows, addressing enterprise AI challenges like governance, scalability, and legacy system integration while prioritizing compliance, observability, and flexible platform adoption.

26 Mar 2026 The Race to Production-Grade Diffusion LLMs with Stefano Ermon

The text traces generative models' evolution from early image generation to diffusion models' stability, highlights Mercury II's advancements in speed and efficiency, and addresses ongoing challenges in scalability, multimodal integration, and future research in controllability and cross-modal unification.

More The TWIML AI Podcast episodes