More Software Engineering Radio episodes

Birol Yildiz on Building an Agentic AI SRE thumbnail

Birol Yildiz on Building an Agentic AI SRE

Published 6 May 2026

Duration: 53:57

AI agents in SRE leverage autonomous decision-making, agentic search, and lightweight architectures to replace static runbooks, balancing autonomy with reliability challenges, context management, and human oversight in dynamic environments.

Episode Description

Birol Yildiz, CEO and co-founder of iLert, joins host Kanchan Shringi to explore how iLert built an AI SRE an autonomous agent for handling production...

Overview

The text explores the development and application of AI agents, which use large language models (LLMs) in reasoning loops to make autonomous decisions, diverging from rigid automated workflows. AI SRE (Site Reliability Engineering) systems, designed as agentic tools from inception, replace traditional runbooks with dynamic decision-making, evolving with the Model Context Protocol (MCP), which enables agents to interface with external systems. Early implementations relied on prescriptive prompts but shifted toward model-driven approaches as LLM capabilities advanced. Challenges include balancing autonomy with prescriptive rules, adapting to rapidly evolving models, and ensuring reliability in dynamic environments like incident response. Agentic searchusing command-line tools like grep and Zethas emerged as a preferred method over vector databases for tasks like root cause analysis (RCA), though managing context dynamically versus providing it upfront remains a challenge. Architectural layers such as knowledge storage (markdown/structured data), orchestration, and constraint management are critical, with agents leveraging sub-agents or forks to handle complex tasks without overloading the main reasoning loop.

Key applications include accelerating incident response, with AI SRE systems aiming to complete RCA in under four minutes, far faster than manual processes. However, validating accuracy of automated RCA is complex due to inconsistent human benchmarking. The architecture emphasizes simplicity, avoiding over-engineering by letting LLMs handle reasoning directly, while prioritizing modular, low-cost models for tasks like alert triage. Testing relies on real-world data, semantic comparisons via LLMs, and BERT scores for evaluating output quality, though scalability and environment duplication remain hurdles. Challenges also include handling novel incidents without predefined runbooks, as seen in a case involving a Kubernetes network policy misconfiguration. Guardrails like data confidence checks, access restrictions, and verification sub-agents mitigate risks, while ethical and compliance considerationssuch as GDPR adherenceshape data handling. The text underscores the tension between rapid innovation and ensuring reliability, emphasizing the need for adaptive, context-aware systems that balance autonomy with safeguards.

Recent Episodes of Software Engineering Radio

10 Jun 2026 Jure Leskovec on Relational Graph and Foundational Models

Predictive modeling faces challenges with AI's limitations in structured data, prompting solutions like graph databases and relational deep learning with attention mechanisms to enhance accuracy, scalability, and real-time updates for enterprise applications.

3 Jun 2026 Dave Airlie on Linux Kernel Maintenance

The Linux kernel, the largest global software project, uses a hierarchical maintainer system with 80,150 contributors managing subsystems like DRM through public review, structured development cycles, and evolving practices to address scalability, quality, and integration challenges.

27 May 2026 Dwayne McDaniel on the Engineering Challenges of Secrets Management

Managing secrets like credentials and API keys in software development risks leaks causing supply chain attacks (e.g., PyPy, Clot, Cisco) due to secrets sprawl, plaintext storage, and misuse, prompting solutions like time-bound credentials, decentralized systems, vault tools (e.g., HashiCorp Vault), and strategies such as credential rotation and encrypted storage amid over 28.65 million hard-coded secrets in GitHub in 2025.

20 May 2026 Rob Moffat on Risk-First Software Development

Recommended: Risk identification and management is a forgotten art

Software development prioritizes risk management through frameworks like test-driven development and agile, addressing hidden risks, AI deployment challenges, open-source dependencies, and organizational prioritization to balance innovation with safeguards.

13 May 2026 SE Radio 720: Martin Dilger on Understanding Eventsourcing

Recommended: Useful Architectural Pattern.

Event sourcing is a system design approach that records changes as sequential events to ensure historical traceability, uses event modeling for aligning systems with human workflows, contrasts with CRUD architectures, and emphasizes slice-based design, event streams, and practical applications like legacy modernization and workflow simplification.

More Software Engineering Radio episodes