More The Kyle Rowland Podcast episodes

Delivering Safely At Scale thumbnail

Delivering Safely At Scale

Published 31 Mar 2026

Recommended: A useful overview of 'what to think about' as you scale, and a good intro to Open Telemetry.

Duration: 54:29

Scaling software delivery demands automation, rigorous testing, safe deployment practices, cross-team coordination, pre-deployment quality checks, staged rollouts, centralized telemetry for monitoring, infrastructure risk management, chaos engineering, and balancing innovation with reliability through metrics-driven, decentralized processes.

Episode Description

Delivering at scale is a challenging problem. Kevin Fleming joins us to give us some tips and tricks that will make our software delivery safer and mo...

Overview

The podcast explores critical aspects of scaling software delivery, emphasizing automation, rigorous testing, and safe deployment practices to prevent outages in mission-critical systems like healthcare and aviation. Key challenges include coordinating cross-team workflows, managing dependencies, and ensuring quality through pre-deployment checks for configuration changes, which account for over 90% of outages. Staged rolloutssuch as canary and pilot regionsare highlighted as essential, paired with automation that halts deployments based on health signals to mitigate risks. Centralized telemetry systems like Microsofts Geneva and open-source tools like OpenTelemetry are stressed for monitoring metrics, logs, and traces, enabling teams to define and emit service-specific data for visibility and safety.

The discussion underscores the complexity of fragmented tooling across deployment layers (e.g., Azure Resource Manager vs. lower-level infrastructure) and the need for adaptability despite centralized systems. Avoiding single points of failure (SPOFs) through redundancy and careful configuration reviews is emphasized, particularly in global deployments. Automated testing, chaos engineering, and decentralized decision-making at the team level are presented as strategies to balance innovation with reliability. The podcast also highlights the shift toward prioritizing quality over customer commitments due to the high stakes of failures in cloud operations, with budget allocations and team autonomy playing a role in managing quality, security, and feature development.

Finally, the role of telemetry in proactive monitoring, incident response, and AI-driven root cause analysis is critical for identifying systemic issues and preventing customer-related incidents. Practices like staged rollouts, real-world customer simulations, and postmortems underscore the importance of balancing scalability with proactive measures to ensure reliability in distributed systems. The conversation reflects a maturity in cloud operations focused on transparency, resilience, and engineering practices that integrate quality into daily workflows rather than relying on top-down mandates.

Final Notes

The text presents several key insights and takeaways relevant to software delivery, distributed systems, and quality assurance. Here are some of the main points:

Challenges in Large Organizations

  • Leadership challenges in balancing coordination across teams, influencing workflows, and managing dependencies.
  • Critical priorities for ensuring quality and safe deployment to avoid outages in critical systems.

Safe Deployment Practices

  • Root cause of outages: 90%+ of outages due to changes (e.g., config deployments, feature enablements).
  • Pre-deployment processes: rigorous checks, rollback capabilities, and validation of blast radius.
  • Deployment workflow: staged rollouts, canary regions, and automated deployment.

Centralized Telemetry and Metrics

  • Use of centralized systems like Geneva for logs, metrics, and operational data.
  • External tools: Prometheus, log analytics (e.g., for external cloud providers).
  • Responsibility for metrics: teams define and emit metrics for their services.

Complexity of Tooling and Infrastructure

  • Different tools for deploying at different layers (e.g., ARM for Azure Resource Manager vs. lower-level infrastructure tools).
  • Chaotic ecosystem due to fragmented tools and platforms.

Scaling and System Stability

  • Challenges in maintaining system stability while introducing new features and scaling to large customer bases.
  • Emphasis on balancing scalability with reliability and proactive issue detection.

AI in Monitoring and Root Cause Analysis

  • AI is being explored to analyze logs and correlate failures across systems, enabling faster mitigation of issues.
  • Proactive use of AI to identify root causes in distributed systems.

Distributed System Challenges

  • Difficulty in pinpointing whether outages stem from own code or external dependencies.
  • Need for tools to correlate failures, identify affected customers, and prioritize communication with impacted users.

Telemetry, Metrics, and Logging

  • Telemetry, metrics, and logging are foundational for visibility, applicable to both startups and large organizations.
  • Intelligent monitoring systems are critical for aggregating and analyzing this data effectively.

Intelligent Monitoring and Deployment

  • Linking deployment systems to health signals (e.g., metrics) enables safer, validated deployments.
  • Tools to detect failures early and halt problematic deployments to prevent cascading issues.

Customer-Level Testing and Simulations

  • Beyond synthetic testing, real-world customer scenarios are emphasized (e.g., simulating user workloads via VMs).
  • Aiming to replicate actual customer use cases for accurate validation of production systems.

Scaling Beyond Infrastructure

  • Scaling is not just about handling more users or requests but maintaining quality and reliability as systems grow.
  • Focus on proactive measures, including correlation tools, customer outreach, and adaptive AI solutions for complex environments.

Key Takeaways

  • Gradual deployment, avoiding deploying to heavy regions first; use canary strategies for testing.
  • Redundancy and SPOF awareness: even redundant systems require scrutiny to avoid SPOFs caused by configuration errors.
  • Telemetry importance: logs and metrics are essential for diagnosing and resolving deployment issues in cloud environments.

These insights and takeaways can be useful for readers who are interested in software delivery, distributed systems, and quality assurance, and want to learn about the challenges and best practices in scaling software delivery, safe deployment practices, centralized telemetry and metrics, and complex infrastructure management.

Recent Episodes of The Kyle Rowland Podcast

28 Jan 2026 Building....People

Coaching and mentoring in engineering are crucial for personal and professional growth, productivity, and long-term business success.

More The Kyle Rowland Podcast episodes