More MLOps.community episodes

Agents are Just While Loops thumbnail

Agents are Just While Loops

Published 15 May 2026

Duration: 00:41:11

Managing long-running agents requires state checkpointing and rehydration for fault tolerance, balancing durability with scalability via modular architectures, orchestration frameworks like Temporal, open standards, and simplified agent designs that separate concerns and leverage existing infrastructure.

Episode Description

Hamza Tahir, co-founder of ZenML, joins the show to cut through the hype around long-running agents arguing that at the end of the day, an agent is ju...

Overview

The podcast explores challenges in managing long-running agents, emphasizing the critical need for checkpointing state within extended while loops to enable recovery from failures and resume execution from precise interruption points. It defines "long running" as context-dependent, ranging from seconds to years, and stresses infrastructure planning for scalability. A central theme is the evolution of harness architectures, distinguishing between basic "inner harnesses" (simple loops interacting with models and tools) and more complex "outer harnesses" that decompose systems into modular components like decision-making "brains" and sandboxed "hands" for tool execution, with examples from Anthropics scalable approach. The discussion also highlights durability mechanisms, such as file system snapshots and memory states, to ensure fault tolerance and avoid redundant work, while using analogies like the Game of Life to illustrate how nested loops form hierarchical agent systems.

The content delves into open standards and infrastructure, advocating for interoperable, open-run harnesses to prevent vendor lock-in and enable broader accessibility, while acknowledging the trade-offs between managed platforms and self-hosted solutions. It critiques current tools for interoperability gaps and calls for horizontally scalable, open-runtime environments that separate models from harnesses to avoid monopolization. Key concepts include durable execution frameworks like Temporal and ZenML, which manage long-running workflows through reliability and replayability, alongside challenges in managing external state persistence (e.g., databases) and reconciling agent workloads (e.g., coding agents) with durability needs like sandboxing and artifact storage. The discussion also contrasts distributed systems (focused on reliability for extended workflows) with bursty container-based workloads, emphasizing differing philosophies in scalability and latency management.

Finally, the podcast examines deployment paradigms and developer experience, critiquing overly complex abstractions in SDKs and advocating for simplicity in agent architecture design. It highlights the importance of state management via artifact stores and dynamic workflows over static DAGs, while addressing challenges in integrating with existing tools and balancing durability with usability. The need for replayability, error recovery, and human-in-the-loop scenarios is underscored, alongside the role of community-driven open-source projects like Kitaru in advancing resilient, durable execution systems. Philosophical reflections on reducing complexity and avoiding over-engineering agent systems are interwoven with practical critiques of existing tools and the limitations of current orchestration approaches.

What If

  • What if you implemented a state checkpointing system using file system dumps every 10 iterations in your long-running agent?

    • Concrete move: Modify your while loop to write a JSON snapshot of the agent's state (e.g., progress, variables, tool calls) to a local or cloud storage every 10 iterations.
    • Why now: Long-running agents are prone to failure (e.g., crashes, timeouts), and checkpointing ensures you avoid redundant work by resuming from the last saved state.
    • Expected upside: Reduced downtime and faster recovery, enabling your agent to handle hours-long tasks without losing progress.
  • What if you designed your harness as an open, modular component separate from your core business logic?

    • Concrete move: Create a standalone Python module (e.g., agent_harness.py) that handles tool calling, state management, and logging, while keeping your business logic in a separate core_business.py file.
    • Why now: Separating harness infrastructure from core logic improves maintainability and avoids vendor lock-in, making it easier to swap out tools or frameworks later.
    • Expected upside: Faster iteration on business logic and compatibility with open standards, reducing dependency on proprietary platforms.
  • What if you adopted a durable execution framework like Temporal to manage your agent's workflow orchestration?

    • Concrete move: Integrate Temporals SDK into your agents architecture to handle retries, state persistence, and distributed coordination automatically.
    • Why now: Temporal natively supports long-running workflows with guaranteed durability, eliminating the need to manually implement checkpointing or error recovery.
    • Expected upside: Reliable, scalable execution of complex agent workflows (e.g., multi-step coding tasks) with minimal custom infrastructure.

Takeaway

  • Implement state checkpointing in long-running loops by using file system dumps, memory snapshots, or transcript loading to ensure seamless recovery from interruptions, preventing redundant work and enabling exact resumption of processes.
  • Adopt an outer harness architecture by decomposing systems into separate components (e.g., "Brain" for decision-making and "Hands" for tool execution) to improve scalability and maintainability, even for solo developers.
  • Prioritize open-standard infrastructure to avoid vendor lock-in, ensuring interoperability and future-proofing your system, even if managed platforms are used for deployment.
  • Leverage durable execution frameworks like Temporal or ZenML to manage long-running workflows, ensuring reliability, replayability, and recovery through infrastructure-level retries and state preservation.
  • Use artifact stores for state management instead of in-place updates or queues, enabling dynamic workflows (e.g., hyperparameter sweeps) and simplifying error recovery in agent systems.

Recent Episodes of MLOps.community

19 May 2026 Autonomous Agents at Work: From OpenClaw Hype to Enterprise Reality

AI agents evolve from question-answering systems to autonomous task execution, requiring risk management through governance frameworks, security measures, human oversight, and ethical integration to address operational, compliance, and safety challenges while balancing AI capabilities with accountability.

12 May 2026 The Latency Goldilocks Zone Explained

iFood's ILO AI agent leverages a Learning Context Model to deliver hyper-personalized food recommendations by integrating diverse AI techniques, navigating cultural nuances, and balancing familiar and novel choices while addressing multi-channel design, latency, scalability, data alignment, and experimental innovation challenges.

8 May 2026 Building MCP Before MCP Existed: Inside Despegar's Sofia Agent

Sophia, an AI-powered travel concierge using a multi-agent system and decentralized collaboration, aims to streamline bookings, in-trip services, and personalized experiences through AI-driven automation, chat/voice interfaces, and orchestration layers, while expanding capabilities and reducing friction in travel processes.

1 May 2026 Voice Agent Use Cases

Designing voice-based AI systems involves balancing user control with automation, addressing speech quality-latency trade-offs, creating intuitive non-technical interfaces, overcoming transcription and turn-taking challenges in real-world environments, integrating hybrid models and domain-specific tuning, while ensuring compliance, user trust, and ethical considerations in applications like customer support and dynamic environments through feedback loops.

24 Apr 2026 The Creator of Superpowers: Why Real Agentic Engineering Beats Vibe Coding

The text discusses using the Greenfield toolset to convert legacy code into structured specifications and the Superpowers framework to enhance AI agents through psychological persuasion techniques, emphasizing task decomposition, subagent roles, challenges in consistency and security, and future trends in agentic problem-solving and ethical AI development.

More MLOps.community episodes