More MLOps.community episodes

How We Cut LLM Latency 70% With TensorRT in Production thumbnail

How We Cut LLM Latency 70% With TensorRT in Production

Published 10 Apr 2026

Duration: 01:05:20

Optimizing AI systems via TensorRT LLM, efficient GPU use, cold start management with AWS FSX, and model quantization, while addressing challenges in in-house development, scaling strategies, hidden scaling complexities ("AI iceberg"), and balancing technical efficiency with organizational alignment through frameworks like Flywheel and responsible AI practices.

Episode Description

Maher Hanafi is an engineering leader who went from zero AI experience to self-hosting LLMs at enterprise scale managing GPU costs, optimizing inferen...

Overview

The text explores strategies for optimizing AI systems, emphasizing efficiency, cost management, and scalability. Techniques such as TensorRT LLM reduced latency by up to 70% through hardware-specific optimizations, while model quantization and GPU packing maximized throughput and minimized resource usage. Cold start time was addressed via preloaded container images and faster storage solutions like AWS FSX, alongside managing GPU initialization delays. Challenges in in-house AI development included balancing performance, latency, accuracy, and cost, with a focus on GPU selection, model fine-tuning, and architecture design. Scaling strategies like scheduled, dynamic, and proactive GPU allocation were tailored to traffic patterns, particularly in low-usage domains like HR tech. The "AI iceberg" concept highlighted invisible complexities such as cost, latency, and response quality, requiring tailored trade-offs for specific use cases.

Iterative optimization and collaboration across teams were critical, with an emphasis on learning alongside engineers and aligning AI initiatives with business goals. The "flywheel framework" guided planning, building, and refining AI projects to ensure high impact with manageable effort. Cost savings were prioritized through strategic GPU upgrades and dynamic scaling, while tools like an LLM proxy enabled efficient load balancing based on prefilling/decoding needs. Challenges included multilingual support, model hallucination, and ensuring transparency and compliance in AI outputs. The text also underscored the need for responsible AI practices, human oversight, and iterative testing to refine systems and align with user expectations, balancing technical innovation with practical deployment constraints.

Recent Episodes of MLOps.community

26 May 2026 Inside Just Eat's AI Lab: Voice Agents & Agentic Commerce

Just Eat Takeaway evolves through AI-driven innovation, voice interfaces, and wearables, focusing on agentic commerce agents, super apps, and no-app models while addressing privacy, device continuity, and logistics challenges like autonomous delivery.

19 May 2026 Autonomous Agents at Work: From OpenClaw Hype to Enterprise Reality

AI agents evolve from question-answering systems to autonomous task execution, requiring risk management through governance frameworks, security measures, human oversight, and ethical integration to address operational, compliance, and safety challenges while balancing AI capabilities with accountability.

15 May 2026 Agents are Just While Loops

Managing long-running agents requires state checkpointing and rehydration for fault tolerance, balancing durability with scalability via modular architectures, orchestration frameworks like Temporal, open standards, and simplified agent designs that separate concerns and leverage existing infrastructure.

12 May 2026 The Latency Goldilocks Zone Explained

iFood's ILO AI agent leverages a Learning Context Model to deliver hyper-personalized food recommendations by integrating diverse AI techniques, navigating cultural nuances, and balancing familiar and novel choices while addressing multi-channel design, latency, scalability, data alignment, and experimental innovation challenges.

8 May 2026 Building MCP Before MCP Existed: Inside Despegar's Sofia Agent

Sophia, an AI-powered travel concierge using a multi-agent system and decentralized collaboration, aims to streamline bookings, in-trip services, and personalized experiences through AI-driven automation, chat/voice interfaces, and orchestration layers, while expanding capabilities and reducing friction in travel processes.

More MLOps.community episodes