More The TWIML AI Podcast episodes

How to Engineer AI Inference Systems with Philip Kiely thumbnail

How to Engineer AI Inference Systems with Philip Kiely

Published 30 Apr 2026

Duration: 00:54:47

AI inference deployment is accelerating, emphasizing inference engineering's critical role in optimizing generative models with advanced hardware and complex systems, while addressing challenges like latency, scalability, and modality-specific optimizations amid evolving industry trends and fragmented yet open-source-driven markets.

Episode Description

In this episode, Philip Kiely, head of AI education at Baseten, joins us to unpack the fast-evolving discipline of inference engineering. We explore w...

Overview

The podcast delves into the rapid evolution of AI inference compared to traditional fields like medicine and physics, where model training typically takes weeks or months, while AI inference can occur within hours. It emphasizes the growing importance of inference engineering, which focuses on deploying and optimizing large generative models in real-time, particularly as models scale to billions of parameters. This shift underscores the distinction between inferencecentral to AI-native companiesand earlier ML ops trends, as the complexity of inference increases with hardware demands, distributed systems, and strict latency requirements. Key challenges include managing technical limitations like insufficient compute resources, system orchestration, and the need for interdisciplinary expertise in GPU programming, quantization, and model parallelism. The field is also marked by a fast research-to-implementation cycle, rivaling industries like high-frequency trading in speed.

The discussion highlights how inference engineering has evolved from a niche concern to a critical industry standard, driven by the practical needs of deploying AI at scale. Companies across various sectors increasingly recognize the necessity of inference strategies to balance performance, cost, and reliability, impacting user experience and competitive advantage. The podcast outlines the spectrum of inference control, from limited user customization in closed systems to full flexibility in self-hosted deployments. It also addresses the transition from pay-per-token models to GPU-based infrastructure, influenced by cost, capacity, and scalability. Additionally, the role of specialized hardware, such as Hopper GPUs, and the fragmented yet advancing open-source ecosystem for inference tools (e.g., VLLM, TensorRT) are explored, alongside trends like compute disaggregation and modality-specific optimizations for tasks like vision or text-to-speech.

Finally, the content addresses the growing demand for inference engineers, driven by the complexity of deploying and optimizing AI systems. It emphasizes the need for interdisciplinary expertise to integrate applied research with infrastructure, while also acknowledging the limits of full automation due to hardware-specific optimizations. The podcast further touches on the future of inference, including the specialization of systems for task-specific workloads, the rise of agent-based systems requiring real-time inference, and the challenges of multimodal models. As AI models become more integral to product development, the strategic role of inference engineering in enabling efficient, reliable, and scalable AI applications is underscored, with implications for businesses seeking competitive differentiation through model-level innovation.

Recent Episodes of The TWIML AI Podcast

16 Apr 2026 How Capital One Delivers Multi-Agent Systems with Rashmi Shetty

Capital One's *Chat Concierge* multi-agentic AI system streamlines car-buying through self-reflection, real-time APIs, and LLM-driven workflows, addressing enterprise AI challenges like governance, scalability, and legacy system integration while prioritizing compliance, observability, and flexible platform adoption.

26 Mar 2026 The Race to Production-Grade Diffusion LLMs with Stefano Ermon

The text traces generative models' evolution from early image generation to diffusion models' stability, highlights Mercury II's advancements in speed and efficiency, and addresses ongoing challenges in scalability, multimodal integration, and future research in controllability and cross-modal unification.

More The TWIML AI Podcast episodes