More MLOps.community episodes

Fixing GPU Starvation in Large-Scale Distributed Training thumbnail

Fixing GPU Starvation in Large-Scale Distributed Training

Published 3 Apr 2026

Duration: 00:52:48

Optimizing ML workflows requires addressing data bottlenecks through caching, efficient structuring, and hardware-aware strategies to reduce remote data calls, minimize GPU-CPU overhead, and prioritize infrastructure over model tuning, while managing trade-offs between training efficiency and serving latency.

Episode Description

Kashish Mittal is a Staff Software Engineer at Uber, working on large-scale distributed systems and core backend infrastructure.Fixing GPU Starvation...

Overview

The podcast discusses critical challenges in data handling and infrastructure efficiency within machine learning (ML) workflows. Key focus areas include optimizing data caching by storing datasets locally on GPUs/CPU during the first epoch to avoid redundant remote calls, which reduces latency and resource overhead. It highlights inefficiencies in reading Parquet files, such as filtering data after full reads, as a major bottleneck in ML pipelines. Emphasis is placed on maximizing GPU utilization (>80%) due to their cost and scarcity, stressing the importance of data pipeline design over model architecture for scalability. The discussion extends to software engineering shifts driven by hardware advancements, where infrastructure constraintslike data pipelinesoften hinder progress more than model optimizations. Balancing GPU efficiency with rapid iteration is framed as essential to avoid resource waste that slows development.

Industry-wide challenges include suboptimal data practices, such as inefficient GPU data transfers, and the risk of future economic trade-offs as cloud compute becomes more expensive. Case studies, such as Googles YouTube models underutilizing A100 GPUs, underscore the need for data restructuring (e.g., batch processing, RAM-based loading) and hardware-aware optimization strategies like flattening tensors. Misconceptions about low performance being attributed to models rather than infrastructure are addressed, with universal bottlenecks identified across CPUs, GPUs, and TPUs. Practical solutions, such as caching data in NumPy format to bypass translation overhead and using per-worker queues for deterministic training, are highlighted. The conversation also touches on broader trade-offs between training and serving, the role of hybrid CPU/GPU approaches, and the importance of reproducibility in parallel systems.

The podcast concludes with insights into emerging trends, such as the evolving use of AI agents for coding and workflow automation, and the necessity of clear documentation and critical thinking frameworks to improve human-AI collaboration. Challenges in debugging, documentation parsing, and the balance between speed and accuracy in AI responses are acknowledged. Overall, the discussion emphasizes that addressing data bottlenecks, aligning infrastructure with hardware capabilities, and fostering efficient practices are pivotal to advancing ML performance and scalability.

Recent Episodes of MLOps.community

7 Apr 2026 Getting Humans Out of the Way: How to Work with Teams of Agents

Recommended: An optimistic view of using Agentic AI with safeguards.

AI agents streamline software development through tools like pixel diff analysis, automated reporting, and annotated walkthroughs, addressing challenges in accuracy, code quality, and workflow adaptation while redefining human roles as validation overseers and collaborators in autonomous systems.

31 Mar 2026 This One Shift Makes Developers Obsolete

Processing live stream data involves transcription, AI-driven skill categorization, GitHub organization, multimedia-comment correlation, and knowledge graphs, while addressing redundancy, AI costs, and MLOps trends, AI agent debates, adversarial workflows, security risks, and tooling like Open Claw and Agent Zero.

30 Mar 2026 Operationalizing AI Agents: From Experimentation to Production // Databricks Roundtable

Deploying AI agents in real-world systems demands robust safety protocols, human oversight, and structured testing to address risks like errors and vulnerabilities, while balancing innovation with responsibility through observability, governance, domain expertise, and tools like MLflow, across use cases from workflow automation to critical system reliability.

27 Mar 2026 arrowspace: Vector Spaces and Graph Wiring

Epiplexity introduces a framework redefining entropy and complexity with structural information, while topological search and graph-based methods enhance semantic accuracy in machine learning by preserving data through high-dimensional embeddings and hybrid geometric-topological analysis, outperforming traditional approaches in retrieval and reasoning tasks.

20 Mar 2026 Agentic Marketplace

AI-driven agent systems in OLX's classifieds marketplace aim to innovate user experiences by overcoming UI constraints through dynamic intent extraction, hybrid chat/UI models, and trust-building in real estate and motors, with future focus on logistics automation, secure transactions, and human-agent integration.

More MLOps.community episodes