More MLOps.community episodes

Serving LLMs in Production: Performance, Cost & Scale // CAST AI Roundtable thumbnail

Serving LLMs in Production: Performance, Cost & Scale // CAST AI Roundtable

Published 19 Feb 2026

Duration: 01:05:55

AI model deployment requires careful planning of infrastructure and scalability to ensure smooth transition from experimental to production stages, considering factors like cost, performance, and control.

Episode Description

Roundtable CAST AI episode: Serving LLMs in Production: Performance, Cost & Scale.Join the Community:https://go.mlops.community/YTJoinInGet the newsle...

Overview

The conversation focuses on the difficulties of moving AI and machine learning models from experimental stages into production, emphasizing the importance of infrastructure planning and scalability. Teams often prioritize solving specific problems or proving concepts without considering the complexities of long-term deployment. As AI adoption expands, there's a growing need to shift from experimentation to scaling, which requires robust MLOps practices. The discussion examines different deployment models, such as APIs, managed GPU services, and self-hosting, each with varying trade-offs in cost, performance, and control. Self-hosting provides the most control and flexibility but demands extensive infrastructure setup, including Kubernetes, GPU orchestration, and auto-scaling, presenting significant complexity.

The choice of infrastructure is influenced by the type of workload, like generative, summarization, or chat-like tasks, which have distinct performance and cost requirements. The conversation highlights key performance metricssuch as time to first token, inter-token latency, and goodputas critical for optimizing model serving. Techniques like model quantization, kernel optimizations, and separating pre-fill and decode phases are discussed as ways to improve efficiency. Overall, the discussion stresses the need to align deployment strategies with specific use cases and user expectations to achieve effective and efficient AI model serving.

Recent Episodes of MLOps.community

31 Mar 2026 This One Shift Makes Developers Obsolete

Processing live stream data involves transcription, AI-driven skill categorization, GitHub organization, multimedia-comment correlation, and knowledge graphs, while addressing redundancy, AI costs, and MLOps trends, AI agent debates, adversarial workflows, security risks, and tooling like Open Claw and Agent Zero.

30 Mar 2026 Operationalizing AI Agents: From Experimentation to Production // Databricks Roundtable

Deploying AI agents in real-world systems demands robust safety protocols, human oversight, and structured testing to address risks like errors and vulnerabilities, while balancing innovation with responsibility through observability, governance, domain expertise, and tools like MLflow, across use cases from workflow automation to critical system reliability.

27 Mar 2026 arrowspace: Vector Spaces and Graph Wiring

Epiplexity introduces a framework redefining entropy and complexity with structural information, while topological search and graph-based methods enhance semantic accuracy in machine learning by preserving data through high-dimensional embeddings and hybrid geometric-topological analysis, outperforming traditional approaches in retrieval and reasoning tasks.

20 Mar 2026 Agentic Marketplace

AI-driven agent systems in OLX's classifieds marketplace aim to innovate user experiences by overcoming UI constraints through dynamic intent extraction, hybrid chat/UI models, and trust-building in real estate and motors, with future focus on logistics automation, secure transactions, and human-agent integration.

17 Mar 2026 Durable Execution and Modern Distributed Systems

Temporal enhances developer productivity by enabling crash-proof workflows through deterministic programming models, separating business logic from fault tolerance, and simplifying distributed systems with durable execution, workflows, activities, and persistence layers like Cassandra/Postgres.

More MLOps.community episodes