More MLOps.community episodes

Serving LLMs in Production: Performance, Cost & Scale // CAST AI Roundtable thumbnail

Serving LLMs in Production: Performance, Cost & Scale // CAST AI Roundtable

Published 19 Feb 2026

Duration: 01:05:55

AI model deployment requires careful planning of infrastructure and scalability to ensure smooth transition from experimental to production stages, considering factors like cost, performance, and control.

Episode Description

Roundtable CAST AI episode: Serving LLMs in Production: Performance, Cost & Scale.Join the Community:https://go.mlops.community/YTJoinInGet the newsle...

Overview

The conversation focuses on the difficulties of moving AI and machine learning models from experimental stages into production, emphasizing the importance of infrastructure planning and scalability. Teams often prioritize solving specific problems or proving concepts without considering the complexities of long-term deployment. As AI adoption expands, there's a growing need to shift from experimentation to scaling, which requires robust MLOps practices. The discussion examines different deployment models, such as APIs, managed GPU services, and self-hosting, each with varying trade-offs in cost, performance, and control. Self-hosting provides the most control and flexibility but demands extensive infrastructure setup, including Kubernetes, GPU orchestration, and auto-scaling, presenting significant complexity.

The choice of infrastructure is influenced by the type of workload, like generative, summarization, or chat-like tasks, which have distinct performance and cost requirements. The conversation highlights key performance metricssuch as time to first token, inter-token latency, and goodputas critical for optimizing model serving. Techniques like model quantization, kernel optimizations, and separating pre-fill and decode phases are discussed as ways to improve efficiency. Overall, the discussion stresses the need to align deployment strategies with specific use cases and user expectations to achieve effective and efficient AI model serving.

Recent Episodes of MLOps.community

12 May 2026 The Latency Goldilocks Zone Explained

iFood's ILO AI agent leverages a Learning Context Model to deliver hyper-personalized food recommendations by integrating diverse AI techniques, navigating cultural nuances, and balancing familiar and novel choices while addressing multi-channel design, latency, scalability, data alignment, and experimental innovation challenges.

8 May 2026 Building MCP Before MCP Existed: Inside Despegar's Sofia Agent

Sophia, an AI-powered travel concierge using a multi-agent system and decentralized collaboration, aims to streamline bookings, in-trip services, and personalized experiences through AI-driven automation, chat/voice interfaces, and orchestration layers, while expanding capabilities and reducing friction in travel processes.

1 May 2026 Voice Agent Use Cases

Designing voice-based AI systems involves balancing user control with automation, addressing speech quality-latency trade-offs, creating intuitive non-technical interfaces, overcoming transcription and turn-taking challenges in real-world environments, integrating hybrid models and domain-specific tuning, while ensuring compliance, user trust, and ethical considerations in applications like customer support and dynamic environments through feedback loops.

24 Apr 2026 The Creator of Superpowers: Why Real Agentic Engineering Beats Vibe Coding

The text discusses using the Greenfield toolset to convert legacy code into structured specifications and the Superpowers framework to enhance AI agents through psychological persuasion techniques, emphasizing task decomposition, subagent roles, challenges in consistency and security, and future trends in agentic problem-solving and ethical AI development.

21 Apr 2026 It's 2026, and We're Still Talking Evals

Evaluations in AI product development must be integrated early, address real-world complexities, use nuanced metrics beyond accuracy, employ user-centric and iterative testing, leverage post-deployment data, and adapt tailored strategies to balance quality, domain-specific metrics, and system reliability.

More MLOps.community episodes