How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

Published 15 Jun 2026

Show Notes: podcasters.spotify.com/pod/show/pen-name/episodes/How-Braintrust-uses-AI-agents--evals--and-CI-to-ship-better-software--Ankur-Goyal-e3kh936

Duration: 00:40:11

AI integration in software engineering enables agents to handle complex tasks through benchmarking and optimization, shifts engineers toward higher-level work, and addresses challenges like reliability, data parsing, and balancing automation with human expertise while emphasizing outcome-focused systems over procedural methods.

Episode Description

In this episode, I sit down with Ankur Goyal, founder and CEO of Braintrust, the AI evals and observability platform used by teams like Notion, Stripe...

Overview

The podcast discusses the evolving role of AI in software engineering, focusing on its capacity to handle complex tasks through rigorous benchmarking and algorithm experimentation. Central themes include the advancement of AI agents in automating technical responsibilities, the development of "hard evals" to test AI performance in code generation, and the use of coding agents to optimize database queries and improve indexing techniques. Engineers are increasingly delegating repetitive or intricate tasks to AI, allowing them to focus on higher-level challenges, though concerns persist about AI potentially replacing human expertise. The discussion highlights the need for accurate knowledge bases and the challenges of parsing large datasets, as well as the limitations of existing column store implementations in databases.

Key debates revolve around balancing AI augmentation with human oversight, the practical versus theoretical quality of AI outputs, and the risks of over-reliance on automated systems. The importance of rigorous problem-solving is emphasized, from addressing technical debt to streamlining workflows with tools like cloud-based environments and port management solutions. AI's role in long-term experimentationsuch as scaling infrastructure tests or optimizing column store performanceis contrasted with human limitations in sustained focus and attention. The podcast also critiques traditional benchmarking practices, advocating for continuous experimentation and quantifiable success metrics to refine AI-driven solutions.

Technical workflows for agents, including the use of scoring functions and safety measures in AI environments, are explored, alongside conceptual shifts in programming that prioritize outcome definition over procedural logic. Discussions on agentic commerce, where autonomous systems handle commercial tasks, and the integration of AI into CI/CD pipelines for faster development cycles, underscore the transformative potential of AI. However, the content also acknowledges challenges in managing AIs role within engineering teams, ensuring human feedback remains integral to refining outputs, and maintaining a balance between innovation and practical constraints. Product development philosophies stress simplicity and user-driven refinement, while emphasizing the need for rigorous evaluation pipelines to identify and address pain points in AI systems.

What If

What if you built a custom AI agent to automate query optimization in your database?
- Move: Create a coding agent using tools like Codex or GPT models to reproduce slow user queries, test column store formats, and benchmark indexing strategies.
- Why Now?: The text highlights the need for rigorous experimentation with open-source column stores and the limitations of existing solutions, which align with your current workload of optimizing slow queries.
- Expected Upside: Reduced manual testing time, faster identification of optimal database configurations, and improved system performance without invasive infrastructure changes.
What if you designed a scoring function to evaluate AI-generated code quality in real time?
- Move: Define success criteria (e.g., code conciseness, adherence to language constraints) and build an eval pipeline using tools like MCP servers to score AI outputs against these metrics.
- Why Now?: The debate around AIs reliability in technical tasks and the need for quantifiable success metrics (discussed in both AI evaluations and product development practices) make this a pressing need.
- Expected Upside: Higher confidence in AI outputs, faster iteration on code generation tasks, and reduced risk of deploying subpar solutions due to automated validation.
What if you used foreground agent concurrency limits to prioritize high-impact tasks?
- Move: Isolate agent workflows with tools like TMUX, and delegate low-value tasks (e.g., port management, CI/CD pipeline monitoring) to custom background agents.
- Why Now?: The text emphasizes the "four-task limit" of human concurrency and the growing trend of building custom agents to handle repetitive work, freeing up time for deeper technical challenges.
- Expected Upside: Improved focus on complex engineering decisions, reduced context-switching overhead, and alignment with workflows that prioritize "maker time" and flow state.

Takeaway

Delegate repetitive technical tasks to AI agents: Use AI-powered coding agents (e.g., Codex, GPT models) to handle routine engineering work like query optimization, database indexing, and testing edge cases, freeing time for strategic decision-making and innovation.
Implement rigorous "hard evals" for AI-generated code: Define clear success criteria and quantitative metrics for evaluating AI outputs (e.g., latency improvements, code correctness) to ensure reliability and performance alignment with technical goals.
Automate query optimization using AI experimentation: Replicate slow user queries, test database optimization strategies (e.g., column store formats, indexing techniques), and validate results with production-like data to iteratively improve system performance.
Invest in cloud-based development environments: Prioritize cloud infrastructure for data-intensive tasks (e.g., large-scale testing, EC2/S3 latency analysis) to avoid local machine limitations, and use tools like Portless to simplify local port management.
Adopt "maker time" prioritization: Block time for deep focus on engineering challenges by limiting meetings (e.g., no meetings after 12:00 PM) and delegating routine tasks to AI agents to maintain productivity and flow state.

Recent Episodes of How I AI

22 Jul 2026 Computer & browser use in Codex (5 real examples)

"AI boosts productivity by automating browser and computer tasks, reducing manual effort through commands like `@browser` or `@computer`, with applications in software testing, personal tasks, and workflow optimization, though human oversight and speed limitations may apply."

20 Jul 2026 How the founder of Morning Brew built a Claude content machine that never runs out of ideas and never sounds like slop | Alex Lieberman

"AI-driven content creation balances efficiency and quality through structured workflows, human oversight, and scalable systems like the 'content machine,' optimizing idea generation, research, and drafting while ensuring authenticity."

13 Jul 2026 This solo builder runs 24/7 local AI on his own hardware | Alex Finn

"Running AI locally offers cost savings, unlimited usage, and privacy, with high-performance hardware enabling automation for tasks like security scans and code reviews, while balancing cloud integration and future AI-driven workflows."

8 Jul 2026 What a harness is and how to build one with Claude Agent SDK

"Harnesses are structured frameworks that enhance AI agent effectiveness by integrating tailored tools, workflows, and constraints for specific tasks like debugging or support, improving efficiency and control over outcomes."

6 Jul 2026 How I run autonomous coding agents from my phone with OpenAI Symphony + Linear | Alessio Fanelli (Kernel Labs)

AI automates small business tasks like inventory tracking and order management via tools such as "magic glasses," explores personal AI use cases (e.g., Codex for hobby tasks), delves into autonomous agent orchestration with cloud-based workflows and GitHub, addresses challenges like scalability and model behavior, and reflects on AIs potential to bridge physical-digital systems, reduce manual effort, and enhance productivity while highlighting underutilized automation opportunities.

More How I AI episodes