More Podcasts by InfoQ episodes

Failure As a Means to Build Resilient Software Systems: A Conversation with Lorin Hochstein thumbnail

Failure As a Means to Build Resilient Software Systems: A Conversation with Lorin Hochstein

Published 31 Mar 2026

Duration: 00:51:41

The text highlights the importance of learning from real-world failures over synthetic chaos experiments to build resilient software architectures, emphasizing systemic resilience engineering, adaptive strategies, and continuous improvement in complex systems.

Episode Description

In this podcast Michael Stiefel spoke to Lorin Hochstein about how real-world failures provide insight into how software systems actually work. Our fi...

Overview

The discussion centers on software reliability engineering, emphasizing how architects and site reliability engineers (SREs) approach system resilience differently. SREs focus on managing real-world failures and incident response, while architects design systems to inherently withstand disruptions. Key challenges include bridging the gap between AI proof-of-concept environments and reliable production deployments, as well as replicating complex, simultaneous real-world failures in synthetic chaos experiments. Tools like Chaos Monkey exemplify proactive resilience testing by simulating instance failures, but their limitations lie in their inability to fully mirror the unpredictable nature of actual incidents. The conversation underscores the value of learning from real system failures through incident reviews and post-mortems, which provide deeper insights into architectural shortcomings and human factors, rather than relying solely on controlled chaos experiments. Despite these efforts, knowledge sharing remains hindered by information overload and a lack of engagement from architects in applying lessons to system design.

The discussion also explores the trade-offs between system reliability and complexity, drawing parallels to biological systems like the immune response, which can cause unintended harm when overactive. Resilience engineering focuses on preparing for unanticipated failures through adaptive strategies, such as robust fallback mechanisms and dynamic scaling, rather than eliminating risks entirely. Feedback mechanisms in control systems, while critical for stability, can lead to cascading failures if mismanaged (e.g., resource saturation). Human and organizational challenges, including communication barriers and the limitations of individual accountability in large systems, further complicate reliability efforts. The conversation highlights the importance of systemic analysis over blaming individuals, advocating for cultural shifts toward blameless post-mortems and continuous improvement. Additionally, the role of complexity in software reliability is examined, noting that incremental changes and historical debt can undermine resilience, requiring ongoing adaptation to evolving environments and technologies. Finally, the dialogue addresses the inherent difficulty of software estimation and the need for systemic solutions to manage uncertainty, contrasting with the more predictable structures of fields like civil engineering.

Recent Episodes of Podcasts by InfoQ

16 Mar 2026 Andres Almiray on How to Release Any Software to Any OS with JReleaser

Discusses challenges in AI deployment from proof-of-concept to production, introduces JReleaser's multi-language release automation with digital signatures and cross-platform integrations, highlights the Common House Foundation's open-source support and regulatory adaptations, and explores automation, cloud integration, and community-driven maintenance strategies for project sustainability.

9 Mar 2026 Mindful Leadership in the Age of AI

Scaling technology initiatives from MVP to production requires a shift from project-based approaches to sustainable growth, overcoming legacy systems, AI integration, and cultural barriers.

2 Mar 2026 Frictionless DevEx with Nicole Forsgren

Software development workflows are being transformed by AI integration, requiring a reevaluation of traditional processes to balance speed, stability, and human-AI collaboration.

More Podcasts by InfoQ episodes