Serving LLMs in Production: Performance, Cost & Scale // CAST AI Roundtable

Published 19 Feb 2026

Show Notes: podcasters.spotify.com/pod/show/mlops/episodes/Serving-LLMs-in-Production-Performance--Cost--Scale--CAST-AI-Roundtable-e3fak28

Duration: 01:05:55

AI model deployment requires careful planning of infrastructure and scalability to ensure smooth transition from experimental to production stages, considering factors like cost, performance, and control.

Episode Description

Roundtable CAST AI episode: Serving LLMs in Production: Performance, Cost & Scale.Join the Community:https://go.mlops.community/YTJoinInGet the newsle...

Overview

The conversation focuses on the difficulties of moving AI and machine learning models from experimental stages into production, emphasizing the importance of infrastructure planning and scalability. Teams often prioritize solving specific problems or proving concepts without considering the complexities of long-term deployment. As AI adoption expands, there's a growing need to shift from experimentation to scaling, which requires robust MLOps practices. The discussion examines different deployment models, such as APIs, managed GPU services, and self-hosting, each with varying trade-offs in cost, performance, and control. Self-hosting provides the most control and flexibility but demands extensive infrastructure setup, including Kubernetes, GPU orchestration, and auto-scaling, presenting significant complexity.

The choice of infrastructure is influenced by the type of workload, like generative, summarization, or chat-like tasks, which have distinct performance and cost requirements. The conversation highlights key performance metricssuch as time to first token, inter-token latency, and goodputas critical for optimizing model serving. Techniques like model quantization, kernel optimizations, and separating pre-fill and decode phases are discussed as ways to improve efficiency. Overall, the discussion stresses the need to align deployment strategies with specific use cases and user expectations to achieve effective and efficient AI model serving.

Recent Episodes of MLOps.community

19 Jun 2026 Sandboxing, Agent Harnesses, and Agent Teamwork

The text examines "Harness" componentsprompts, tools, and feedback systemsthat balance AI agent autonomy with control through adaptive strategies, human oversight, and iterative testing to improve reliability and alignment with human judgment in dynamic tasks.

16 Jun 2026 MCP Servers Are Becoming the UI for AI Agents

Gateways as proxies for AI via MCP address security, traffic control, and cost management while tackling server development challenges, optimization of tool calls, microservices scaling, protocol tracing limitations, ownership shifts, and the need for unbiased evaluations and agent-driven usability assessments.

12 Jun 2026 MCP, Agents & the $40M Bet on Multiplayer AI

Recommended: Multiplayer Bots as a Action Paradigm

The integration of AI into work practices shifts toward collaborative "multiplayer" systems using flocking-inspired dynamics, addressing challenges like limited AI time horizons, technical tools for shared collaboration, balancing human-AI roles, infrastructure scaling, and the need for adaptive governance and futureproofing.

9 Jun 2026 From Single-Player to Multi-Player: Operating AI Agents at Scale

AI agent infrastructure and governance require control planes for security, compliance, and risk mitigation, addressing operational challenges, productivity gains, and the need for standardized frameworks, modular designs, and transparent collaboration.

5 Jun 2026 The Control-vs-Magic Spectrum Building Agents

iFood Pago leverages AI-driven tools like ChatBank to automate financial services for Brazilian restaurants, balancing automation with personalization while addressing challenges in scaling AI, risk management, and the impact of declining training costs on software accessibility.

More MLOps.community episodes