It's 2026, and We're Still Talking Evals

Published 21 Apr 2026

Show Notes: podcasters.spotify.com/pod/show/mlops/episodes/Its-2026--and-Were-Still-Talking-Evals-e3i0pe1

Duration: 00:40:56

Evaluations in AI product development must be integrated early, address real-world complexities, use nuanced metrics beyond accuracy, employ user-centric and iterative testing, leverage post-deployment data, and adapt tailored strategies to balance quality, domain-specific metrics, and system reliability.

Episode Description

Maggie Konstanty is an AI Product Manager at Prosus, one of the world's largest consumer internet companies, where she builds and evaluates AI agents...

Overview

The discussion emphasizes the critical role of evaluations (evals) in AI product development, advocating for their integration from the early stages of ideation to ensure quality before deployment. Teams often delay setting up evals, leading to inefficiencies and confusion. Pre-production evaluations typically rely on simulated tasks, but real-world scenariossuch as unexpected user queriesrequire adaptive methods. Post-deployment evaluations must shift from static tests to dynamic, user-generated data and metrics. Challenges include non-deterministic failure modes in large language models (LLMs), where systems may perform well repeatedly before failing unexpectedly, necessitating nuanced strategies. User-centric approaches involve simulating diverse personas to model real-world interactions, while traditional accuracy metrics are criticized for oversimplification. Metrics like TNR and TPR are preferred for comparing AI responses to human-labeled outcomes. Current practices face limitations, such as LLMs losing coherence or task alignment over time, and the risks of evaluating LLMs with other LLMs. Balancing feature prioritization with iterative testing and failure mode analysis is highlighted as crucial for refining AI agents post-deployment.

Key challenges include the ambiguity of accuracy metrics in real-world applications and the underutilization of error analysis for edge cases. Scenario-based testing, involving predefined interactions and personas, is stressed as a method to measure consistency and identify performance gaps. Pre-release evaluations focus on internal simulations, while post-release analyses rely on user data to uncover hidden failures. Iterative testingsuch as AB testingallows for rapid adjustments based on feedback. Unconventional strategies, like stress-testing with extreme scenarios, help reveal hidden flaws. Continuous monitoring is essential, as real-world unpredictability demands ongoing adaptation. Evaluations are also framed as iterative processes, not one-time tasks, requiring consistent refinement. The discussion underscores the need for tailored evaluation frameworks that align with specific use cases, user needs, and domain-specific metrics (e.g., conversion rates for food ordering vs. satisfaction metrics in automotive services).

The conversation also highlights the cultural and practical challenges of implementing evaluative rigor, such as perceiving error analysis as tedious or resource-intensive, and the tendency to skip it in favor of more immediately gratifying development tasks. Evaluators must be closely aligned with business goals to avoid redundancy, and existing tools often lack support for multi-turn conversations, large datasets, or efficient data export. Custom tools are emphasized for creating pipelines and evaluators that address specific failure modes and regressions. Internal development of evaluators is preferred to maintain control over data privacy and alignment with team-specific needs. Ultimately, the focus remains on defining clear success metrics, understanding user intent ambiguity, and fostering team alignment on evaluation priorities as the foundation for reliable AI systems.

Recent Episodes of MLOps.community

17 Apr 2026 Why Agents are Driving Software Development to the Cloud

The text promotes transitioning from isolated AI agents to cloud-native platforms that treat agents as autonomous team members with defined roles, emphasizing structured governance, transparency, and natural language interaction to streamline collaboration and workflows like code review and data analysis.

14 Apr 2026 The Modern Software Engineer

Recommended: A throughtful overview on the impact of AI covering the impact on learning and skill aquisition.

AI transforms learning and workflows through tools like Claude, accelerating skill acquisition and bridging knowledge gaps, while raising concerns about job obsolescence, ethical dilemmas, and the need for human oversight, standardized practices, and collaborative approaches in an era of rapid tech advancement.

10 Apr 2026 How We Cut LLM Latency 70% With TensorRT in Production

Optimizing AI systems via TensorRT LLM, efficient GPU use, cold start management with AWS FSX, and model quantization, while addressing challenges in in-house development, scaling strategies, hidden scaling complexities ("AI iceberg"), and balancing technical efficiency with organizational alignment through frameworks like Flywheel and responsible AI practices.

7 Apr 2026 Getting Humans Out of the Way: How to Work with Teams of Agents

Recommended: An optimistic view of using Agentic AI with safeguards.

AI agents streamline software development through tools like pixel diff analysis, automated reporting, and annotated walkthroughs, addressing challenges in accuracy, code quality, and workflow adaptation while redefining human roles as validation overseers and collaborators in autonomous systems.

3 Apr 2026 Fixing GPU Starvation in Large-Scale Distributed Training

Optimizing ML workflows requires addressing data bottlenecks through caching, efficient structuring, and hardware-aware strategies to reduce remote data calls, minimize GPU-CPU overhead, and prioritize infrastructure over model tuning, while managing trade-offs between training efficiency and serving latency.

More MLOps.community episodes