Failure As a Means to Build Resilient Software Systems: A Conversation with Lorin Hochstein

Published 31 Mar 2026

Show Notes: soundcloud.com/infoq-channel/failure-as-a-means-to-build

Duration: 00:51:41

The text highlights the importance of learning from real-world failures over synthetic chaos experiments to build resilient software architectures, emphasizing systemic resilience engineering, adaptive strategies, and continuous improvement in complex systems.

Episode Description

In this podcast Michael Stiefel spoke to Lorin Hochstein about how real-world failures provide insight into how software systems actually work. Our fi...

Overview

The discussion centers on software reliability engineering, emphasizing how architects and site reliability engineers (SREs) approach system resilience differently. SREs focus on managing real-world failures and incident response, while architects design systems to inherently withstand disruptions. Key challenges include bridging the gap between AI proof-of-concept environments and reliable production deployments, as well as replicating complex, simultaneous real-world failures in synthetic chaos experiments. Tools like Chaos Monkey exemplify proactive resilience testing by simulating instance failures, but their limitations lie in their inability to fully mirror the unpredictable nature of actual incidents. The conversation underscores the value of learning from real system failures through incident reviews and post-mortems, which provide deeper insights into architectural shortcomings and human factors, rather than relying solely on controlled chaos experiments. Despite these efforts, knowledge sharing remains hindered by information overload and a lack of engagement from architects in applying lessons to system design.

The discussion also explores the trade-offs between system reliability and complexity, drawing parallels to biological systems like the immune response, which can cause unintended harm when overactive. Resilience engineering focuses on preparing for unanticipated failures through adaptive strategies, such as robust fallback mechanisms and dynamic scaling, rather than eliminating risks entirely. Feedback mechanisms in control systems, while critical for stability, can lead to cascading failures if mismanaged (e.g., resource saturation). Human and organizational challenges, including communication barriers and the limitations of individual accountability in large systems, further complicate reliability efforts. The conversation highlights the importance of systemic analysis over blaming individuals, advocating for cultural shifts toward blameless post-mortems and continuous improvement. Additionally, the role of complexity in software reliability is examined, noting that incremental changes and historical debt can undermine resilience, requiring ongoing adaptation to evolving environments and technologies. Finally, the dialogue addresses the inherent difficulty of software estimation and the need for systemic solutions to manage uncertainty, contrasting with the more predictable structures of fields like civil engineering.

Recent Episodes of Podcasts by InfoQ

15 Jun 2026 Increasing Users Data Agency: From BlueSky's AT Protocol to the Local-First Software Movement

Discusses challenges in AI integration, the shift to modular cloud-native systems using Apache Parquet, decentralized infrastructures like Blue Skys AT Protocol, the Local First movement prioritizing local data storage, AutoMerge for collaborative non-text files, retrofitting hurdles, and open standards to combat vendor lock-in.

8 Jun 2026 From MCP and Vibe Coding to Harness Engineering: How Did AI Native Engineering Evolve in One Year

The evolving AI adoption in software delivery involves architecture, collaboration, and rapid advancements, highlighting shifts in coding tools from autocomplete to agentic modes, context engineering challenges, hybrid tool use, local model limitations, privacy concerns, and the need for formal validation and industry-academia collaboration to enhance agent autonomy and address reliability gaps.

1 Jun 2026 Requirements Analysis for Architects: A Conversation with Sonya Natanzon

Architects must balance technical and business priorities, prioritize user satisfaction and organizational goals, navigate communication challenges, apply domain-driven design principles, address AI's impact on software development, and adapt to evolving technologies while emphasizing creativity and strategic alignment.

25 May 2026 Chasing Efficient Java Development: From 1BRC to Developing Hardwood AI Natively

AI and architecture challenges, Java's evolving ecosystem with performance optimizations and legacy practices, columnar data formats like Parquet, dependency management, and balancing AI adoption with developer skill retention.

18 May 2026 Context is the Key to the Agentic Architecture Revolution: A Conversation with Baruch Sadogursky

AI adoption in architectural decision-making emphasizes trade-offs between efficiency and complexity, challenges of ambiguous requirements, context-driven engineering, frameworks like the Intent Integrity Kit for iterative clarity, architect roles in managing systems and stakeholder dynamics, and the need to balance AI capabilities with human oversight amid ethical and technical limitations.

More Podcasts by InfoQ episodes