More How I AI episodes

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal thumbnail

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

Published 15 Jun 2026

Duration: 00:40:11

AI integration in software engineering enables agents to handle complex tasks through benchmarking and optimization, shifts engineers toward higher-level work, and addresses challenges like reliability, data parsing, and balancing automation with human expertise while emphasizing outcome-focused systems over procedural methods.

Episode Description

In this episode, I sit down with Ankur Goyal, founder and CEO of Braintrust, the AI evals and observability platform used by teams like Notion, Stripe...

Overview

The podcast discusses the evolving role of AI in software engineering, focusing on its capacity to handle complex tasks through rigorous benchmarking and algorithm experimentation. Central themes include the advancement of AI agents in automating technical responsibilities, the development of "hard evals" to test AI performance in code generation, and the use of coding agents to optimize database queries and improve indexing techniques. Engineers are increasingly delegating repetitive or intricate tasks to AI, allowing them to focus on higher-level challenges, though concerns persist about AI potentially replacing human expertise. The discussion highlights the need for accurate knowledge bases and the challenges of parsing large datasets, as well as the limitations of existing column store implementations in databases.

Key debates revolve around balancing AI augmentation with human oversight, the practical versus theoretical quality of AI outputs, and the risks of over-reliance on automated systems. The importance of rigorous problem-solving is emphasized, from addressing technical debt to streamlining workflows with tools like cloud-based environments and port management solutions. AI's role in long-term experimentationsuch as scaling infrastructure tests or optimizing column store performanceis contrasted with human limitations in sustained focus and attention. The podcast also critiques traditional benchmarking practices, advocating for continuous experimentation and quantifiable success metrics to refine AI-driven solutions.

Technical workflows for agents, including the use of scoring functions and safety measures in AI environments, are explored, alongside conceptual shifts in programming that prioritize outcome definition over procedural logic. Discussions on agentic commerce, where autonomous systems handle commercial tasks, and the integration of AI into CI/CD pipelines for faster development cycles, underscore the transformative potential of AI. However, the content also acknowledges challenges in managing AIs role within engineering teams, ensuring human feedback remains integral to refining outputs, and maintaining a balance between innovation and practical constraints. Product development philosophies stress simplicity and user-driven refinement, while emphasizing the need for rigorous evaluation pipelines to identify and address pain points in AI systems.

What If

  • What if you built a custom AI agent to automate query optimization in your database?

    • Move: Create a coding agent using tools like Codex or GPT models to reproduce slow user queries, test column store formats, and benchmark indexing strategies.
    • Why Now?: The text highlights the need for rigorous experimentation with open-source column stores and the limitations of existing solutions, which align with your current workload of optimizing slow queries.
    • Expected Upside: Reduced manual testing time, faster identification of optimal database configurations, and improved system performance without invasive infrastructure changes.
  • What if you designed a scoring function to evaluate AI-generated code quality in real time?

    • Move: Define success criteria (e.g., code conciseness, adherence to language constraints) and build an eval pipeline using tools like MCP servers to score AI outputs against these metrics.
    • Why Now?: The debate around AIs reliability in technical tasks and the need for quantifiable success metrics (discussed in both AI evaluations and product development practices) make this a pressing need.
    • Expected Upside: Higher confidence in AI outputs, faster iteration on code generation tasks, and reduced risk of deploying subpar solutions due to automated validation.
  • What if you used foreground agent concurrency limits to prioritize high-impact tasks?

    • Move: Isolate agent workflows with tools like TMUX, and delegate low-value tasks (e.g., port management, CI/CD pipeline monitoring) to custom background agents.
    • Why Now?: The text emphasizes the "four-task limit" of human concurrency and the growing trend of building custom agents to handle repetitive work, freeing up time for deeper technical challenges.
    • Expected Upside: Improved focus on complex engineering decisions, reduced context-switching overhead, and alignment with workflows that prioritize "maker time" and flow state.

Takeaway

  • Delegate repetitive technical tasks to AI agents: Use AI-powered coding agents (e.g., Codex, GPT models) to handle routine engineering work like query optimization, database indexing, and testing edge cases, freeing time for strategic decision-making and innovation.
  • Implement rigorous "hard evals" for AI-generated code: Define clear success criteria and quantitative metrics for evaluating AI outputs (e.g., latency improvements, code correctness) to ensure reliability and performance alignment with technical goals.
  • Automate query optimization using AI experimentation: Replicate slow user queries, test database optimization strategies (e.g., column store formats, indexing techniques), and validate results with production-like data to iteratively improve system performance.
  • Invest in cloud-based development environments: Prioritize cloud infrastructure for data-intensive tasks (e.g., large-scale testing, EC2/S3 latency analysis) to avoid local machine limitations, and use tools like Portless to simplify local port management.
  • Adopt "maker time" prioritization: Block time for deep focus on engineering challenges by limiting meetings (e.g., no meetings after 12:00 PM) and delegating routine tasks to AI agents to maintain productivity and flow state.

Recent Episodes of How I AI

9 Jun 2026 Claude Fable 5 review: what the new Mythos model gets right (and very wrong)

Anthropic's Claude Fable Five excels in long-term technical tasks with strong coding, vision, and async workflow capabilities but faces high token costs, design limitations, and restricted use in cybersecurity/biology, making it suitable for precise, extended projects rather than creative or agile workflows.

28 May 2026 Claude Opus 4.8 is here. Is it as good as they say?

Anthropic's Opus 4.8 model improves honesty and efficiency with reduced hallucinations but struggles with contextual coding, complex strategic analysis, and depth in agentic tasks, excelling in simple prototypes yet falling short in nuanced, long-horizon applications.

27 May 2026 The Codex feature that works while you sleep

Goals in Codex leverages AI to autonomously execute complex tasks through goal-based workflows, emphasizing clarity and validation for improved code quality and efficiency, though it struggles with simple edits and may shift developers toward oversight roles.

More How I AI episodes