More Podcasts by InfoQ episodes

Failure As a Means to Build Resilient Software Systems: A Conversation with Lorin Hochstein thumbnail

Failure As a Means to Build Resilient Software Systems: A Conversation with Lorin Hochstein

Published 31 Mar 2026

Duration: 00:51:41

The text highlights the importance of learning from real-world failures over synthetic chaos experiments to build resilient software architectures, emphasizing systemic resilience engineering, adaptive strategies, and continuous improvement in complex systems.

Episode Description

In this podcast Michael Stiefel spoke to Lorin Hochstein about how real-world failures provide insight into how software systems actually work. Our fi...

Overview

The discussion centers on software reliability engineering, emphasizing how architects and site reliability engineers (SREs) approach system resilience differently. SREs focus on managing real-world failures and incident response, while architects design systems to inherently withstand disruptions. Key challenges include bridging the gap between AI proof-of-concept environments and reliable production deployments, as well as replicating complex, simultaneous real-world failures in synthetic chaos experiments. Tools like Chaos Monkey exemplify proactive resilience testing by simulating instance failures, but their limitations lie in their inability to fully mirror the unpredictable nature of actual incidents. The conversation underscores the value of learning from real system failures through incident reviews and post-mortems, which provide deeper insights into architectural shortcomings and human factors, rather than relying solely on controlled chaos experiments. Despite these efforts, knowledge sharing remains hindered by information overload and a lack of engagement from architects in applying lessons to system design.

The discussion also explores the trade-offs between system reliability and complexity, drawing parallels to biological systems like the immune response, which can cause unintended harm when overactive. Resilience engineering focuses on preparing for unanticipated failures through adaptive strategies, such as robust fallback mechanisms and dynamic scaling, rather than eliminating risks entirely. Feedback mechanisms in control systems, while critical for stability, can lead to cascading failures if mismanaged (e.g., resource saturation). Human and organizational challenges, including communication barriers and the limitations of individual accountability in large systems, further complicate reliability efforts. The conversation highlights the importance of systemic analysis over blaming individuals, advocating for cultural shifts toward blameless post-mortems and continuous improvement. Additionally, the role of complexity in software reliability is examined, noting that incremental changes and historical debt can undermine resilience, requiring ongoing adaptation to evolving environments and technologies. Finally, the dialogue addresses the inherent difficulty of software estimation and the need for systemic solutions to manage uncertainty, contrasting with the more predictable structures of fields like civil engineering.

Recent Episodes of Podcasts by InfoQ

4 May 2026 Roq: Leveraging Quarkus to Build Static Sites at the Speed of Go

Java's resurgence is fueled by performance gains, modern frameworks like Quarkus, and native compilation, exemplified by Rooka lightweight static site generator leveraging Quarkus for dynamic rendering, Markdown content, and streamlined workflows, with future AI integration and open-source advancements.

20 Apr 2026 Engineering Stable, Secure and Scalable Platforms: A Conversation with Matthew Liste

Systems engineering and software development's evolution emphasizes hands-on learning, mentorship, and intuitive experience, while addressing AI's impact on apprenticeships, balancing abstraction with deep system understanding, managing risks in high-stakes sectors, navigating innovation-stability trade-offs, scaling complex systems, evolving engineer roles, customer feedback loops, fostering continuous learning and collaboration, and prioritizing craftsmanship, systemic thinking, and the synergy between technical precision and practicality.

13 Apr 2026 How SBOMs and Engineering Discipline Can Help You Avoid Trivys Compromise

Strengthening mobile app security beyond minimal standards, leveraging Software Bill of Materials (SBOM) to address supply chain risks under legislative mandates like U.S. Executive Orders and the EU's Cyber Resilience Act, and utilizing tools such as Cyclone DX and SPDX for dependency tracking, compliance, and mitigating supply chain attacks through improved tooling and practices like OIDC authentication.

More Podcasts by InfoQ episodes