More MLOps.community episodes

How We Cut LLM Latency 70% With TensorRT in Production thumbnail

How We Cut LLM Latency 70% With TensorRT in Production

Published 10 Apr 2026

Duration: 01:05:20

Optimizing AI systems via TensorRT LLM, efficient GPU use, cold start management with AWS FSX, and model quantization, while addressing challenges in in-house development, scaling strategies, hidden scaling complexities ("AI iceberg"), and balancing technical efficiency with organizational alignment through frameworks like Flywheel and responsible AI practices.

Episode Description

Maher Hanafi is an engineering leader who went from zero AI experience to self-hosting LLMs at enterprise scale managing GPU costs, optimizing inferen...

Overview

The text explores strategies for optimizing AI systems, emphasizing efficiency, cost management, and scalability. Techniques such as TensorRT LLM reduced latency by up to 70% through hardware-specific optimizations, while model quantization and GPU packing maximized throughput and minimized resource usage. Cold start time was addressed via preloaded container images and faster storage solutions like AWS FSX, alongside managing GPU initialization delays. Challenges in in-house AI development included balancing performance, latency, accuracy, and cost, with a focus on GPU selection, model fine-tuning, and architecture design. Scaling strategies like scheduled, dynamic, and proactive GPU allocation were tailored to traffic patterns, particularly in low-usage domains like HR tech. The "AI iceberg" concept highlighted invisible complexities such as cost, latency, and response quality, requiring tailored trade-offs for specific use cases.

Iterative optimization and collaboration across teams were critical, with an emphasis on learning alongside engineers and aligning AI initiatives with business goals. The "flywheel framework" guided planning, building, and refining AI projects to ensure high impact with manageable effort. Cost savings were prioritized through strategic GPU upgrades and dynamic scaling, while tools like an LLM proxy enabled efficient load balancing based on prefilling/decoding needs. Challenges included multilingual support, model hallucination, and ensuring transparency and compliance in AI outputs. The text also underscored the need for responsible AI practices, human oversight, and iterative testing to refine systems and align with user expectations, balancing technical innovation with practical deployment constraints.

Recent Episodes of MLOps.community

7 Apr 2026 Getting Humans Out of the Way: How to Work with Teams of Agents

Recommended: An optimistic view of using Agentic AI with safeguards.

AI agents streamline software development through tools like pixel diff analysis, automated reporting, and annotated walkthroughs, addressing challenges in accuracy, code quality, and workflow adaptation while redefining human roles as validation overseers and collaborators in autonomous systems.

3 Apr 2026 Fixing GPU Starvation in Large-Scale Distributed Training

Optimizing ML workflows requires addressing data bottlenecks through caching, efficient structuring, and hardware-aware strategies to reduce remote data calls, minimize GPU-CPU overhead, and prioritize infrastructure over model tuning, while managing trade-offs between training efficiency and serving latency.

31 Mar 2026 This One Shift Makes Developers Obsolete

Processing live stream data involves transcription, AI-driven skill categorization, GitHub organization, multimedia-comment correlation, and knowledge graphs, while addressing redundancy, AI costs, and MLOps trends, AI agent debates, adversarial workflows, security risks, and tooling like Open Claw and Agent Zero.

30 Mar 2026 Operationalizing AI Agents: From Experimentation to Production // Databricks Roundtable

Deploying AI agents in real-world systems demands robust safety protocols, human oversight, and structured testing to address risks like errors and vulnerabilities, while balancing innovation with responsibility through observability, governance, domain expertise, and tools like MLflow, across use cases from workflow automation to critical system reliability.

27 Mar 2026 arrowspace: Vector Spaces and Graph Wiring

Epiplexity introduces a framework redefining entropy and complexity with structural information, while topological search and graph-based methods enhance semantic accuracy in machine learning by preserving data through high-dimensional embeddings and hybrid geometric-topological analysis, outperforming traditional approaches in retrieval and reasoning tasks.

More MLOps.community episodes