More Scaling DevTools episodes

Lawrence Jones from Incident.io @ AIE Europe: building an AI SRE thumbnail

Lawrence Jones from Incident.io @ AIE Europe: building an AI SRE

Published 14 Apr 2026

Duration: 00:09:26

Advancements in AI-driven SRE include automated root cause analysis tools for managing production system complexity, challenges in scaling AI due to logging and data variability, the need for human expertise in contextual reasoning, handling vast log telemetry, and examples of AI uncovering undocumented system behaviors while emphasizing centralized incident management, ambient analysis tools, and the upcoming AI Incident System (AIS).

Episode Description

Recorded at AI Engineers Europe, Lawrence Jones is an AI engineer at Incident.io and he shares his experiences building an AI SRE.Links:Incident.io ht...

Overview

The podcast discusses the application of AI in Software Reliability Engineering (SRE), focusing on developing tools to manage complexity in production systems through automated root cause analysis and incident investigation. Key challenges include scaling investigations across large numbers of customer accounts and ensuring AI can monitor performance metrics like accuracy and degradation. AIs effectiveness is highlighted with a reported 85-90% accuracy in root cause analysis for well-configured systems, though non-deterministic issues such as logging failures, misconfigurations, and external data inconsistencies pose variability in results. The discussion emphasizes the need for human expertise to complement AI, particularly through runbooks and analysis playbooks that provide contextual reasoning and structure data for meaningful insights, addressing AIs limitations in mathematical analysis and context-based interpretation.

The role of organizational context is critical, with AI tools requiring internal infrastructure and historical data to avoid spurious conclusions, especially compared to generic tools lacking domain-specific knowledge. Log telemetry challenges are also addressed, emphasizing the need for structured summarization to prevent insights from being lost in overwhelming volumes of data. Product integration focuses on enabling direct access to AI-driven investigations via desktop apps linked to tools like Court or Codex, with an emphasis on confidence scoring and alignment with incident management workflows. Examples include AI detecting undocumented system behaviors, such as a 750ms timeout in a telecom providers documentation, and resolving issues faster than manual teams. The potential of ambient analysis tools to identify unpredictable patterns and anomalies is highlighted, alongside plans for the upcoming AI Incident System (AIS), which aims to broadly detect previously undetected issues and streamline collaboration through centralized data sharing during incidents.

Recent Episodes of Scaling DevTools

29 Mar 2026 Finding your first 10 customers, with Andy Lee from DeepTrace

Recommended: Outreach strategies using LinkedIn because building is easy, finding customers is hard.

Technical founders can secure early customers through targeted, high-volume outreach using personalized LinkedIn messaging, cold emails, and mentorship-driven engagement, leveraging their product expertise to overcome engineers' skepticism, with examples like DeepTrace showcasing engineering-focused solutions and response rate benchmarks.

22 Mar 2026 DatoCMS: bootstrapping to 6.5M ARR

A company prioritizes agile innovation and developer-focused simplicity by building a specialized headless CMS (Data CMS) with a small remote team, achieving $6.5M in revenue through strategic partnerships, cost-effective pricing, and alignment with Jamstack trends while emphasizing sustainable growth over rapid scaling.

More Scaling DevTools episodes