More CoRecursive: Coding Stories episodes

The Bitter Lesson: The history of reinforcement learning thumbnail

The Bitter Lesson: The history of reinforcement learning

Published 13 Jun 2026

Duration: 01:00:01

The discussion critiques reward-driven models of intelligence by contrasting behaviorist roots with modern AI advancements like neural networks and self-play, examining historical cases such as TD-Gammon and AlphaGo, while highlighting the limitations of reward frameworks in capturing autonomy and the shift toward data-driven learning over human-encoded rules.

Episode Description

I've been trying to understand how machine learning actually works. Not use it,understand it, down to the ifs and loops. How does a program built out...

Overview

The podcast explores the intersection of reinforcement learning and behaviorist principles, examining how intelligence might be framed through reward maximization. It critiques the reductionist view presented in the 2021 paper "Reward is Enough," which argues that intelligence can be reduced to maximizing rewards, akin to operant conditioning. This perspective is contrasted with historical influences, such as B.F. Skinners behaviorism and its applications, including his wartime use of trained pigeons for guidance systems. The discussion questions whether such a reductionist approach overlooks the complexity of intelligence, citing examples like the failed use of pigeons in quality assurance at Eli Lilly and the limitations of reward-driven systems in real-world contexts.

The conversation delves into the evolution of reinforcement learning, from Richard Suttons foundational 1989 work on temporal difference learningdemonstrated through tic-tac-toeand its application in complex domains like backgammon, where neural networks and self-play enabled AI to outperform humans with unconventional strategies. Despite early successes, such as the 1992 TD Gammon project, these innovations were initially overlooked, overshadowed by the prominence of systems like Deep Blue. The podcast highlights how modern AI, including AlphaGo and AlphaZero, shifts from rule-based approaches to self-play and reward-driven learning, critiquing traditional AI methods that relied on human-engineered rules. Suttons bitter lesson emphasizes the superiority of computationally intensive, data-driven models over human-designed systems, while also raising philosophical questions about whether reward frameworks can fully capture the complexity of human behavior or creativity.

Finally, the podcast addresses challenges in applying reinforcement learning to tasks beyond games, such as language models, and debates whether reward-based systems can truly replicate autonomy or remain limited to mimicking human knowledge. It underscores the tension between computational dominance in defined benchmarks and the need for human adaptation to maintain relevance in an AI-dominated future. The discussion also touches on historical parallels between Skinners behaviorist experiments and modern AI, suggesting a shared reductionist undercurrent while questioning the long-term trajectory toward general artificial intelligence.

What If

  • What if you applied self-play learning to a niche business problem without rule-based constraints?

    • Move: Implement a self-play loop for a simplified version of your product or service, using reinforcement learning to optimize key metrics (e.g., user retention or conversion rates).
    • Why Now?: Modern tools like TensorFlow or PyTorch enable lightweight simulation environments, and your domain-specific data can serve as the "reward signal" for training.
    • Expected Upside: Discover unconventional strategies that outperform manual rule-based systems, similar to how TD Gammon found novel backgammon tactics beyond human expertise.
  • What if you built a reward-driven AI agent to automate customer service without predefined scripts?

    • Move: Design a reward system where the agent maximizes user satisfaction scores (e.g., through sentiment analysis) instead of following rigid templates.
    • Why Now?: Platforms like Dialogflow or Hugging Face allow rapid prototyping, and user feedback loops (e.g., ratings or support tickets) provide real-time "rewards" for training.
    • Expected Upside: Create an adaptive support system that evolves with user needs, mirroring AlphaGos ability to generate novel strategies through self-play.
  • What if you used temporal difference learning to predict business outcomes in real time?

    • Move: Train a neural network on historical data to approximate future states (e.g., sales forecasts or stock prices) using backward propagation of outcomes, akin to TD Gammons value estimation.
    • Why Now?: Publicly available datasets and cloud-based GPU access (e.g., AWS or Google Colab) make high-dimensional pattern recognition scalable for small teams.
    • Expected Upside: Achieve predictive accuracy by leveraging game-like reinforcement learning principles, reducing reliance on static models that fail in complex, dynamic markets.

Takeaway

  • Leverage Neural Networks for State Approximation: Use neural networks to generalize game states instead of exhaustively mapping all possibilities, reducing memory and computational overhead (e.g., TD Gammon's approach to backgammon with pattern recognition).
  • Implement Self-Play Learning Without Human Data: Train AI models via self-play (e.g., AlphaZero) to discover novel strategies autonomously, avoiding reliance on human-encoded rules or pre-existing datasets.
  • Apply Reward-Based Learning with Temporal Difference Methods: Use reward signals (e.g., win/loss outcomes) and temporal difference learning (TD) to update value estimates in environments like games or tasks, as demonstrated in Suttons tic-tac-toe and backgammon examples.
  • Optimize Input Representation for Simplicity: Simplify high-dimensional data (e.g., pixel data in Atari games) by reducing resolution or using grayscale inputs, enabling efficient processing and faster learning through trial and error.
  • Prioritize Computational Resources for Iterative Training: Invest in scalable computing infrastructure (e.g., cloud GPUs) to run large-scale simulations and self-play iterations, essential for training complex models like AlphaGo or deep reinforcement learning agents.

Recent Episodes of CoRecursive: Coding Stories

9 May 2026 The Pre-Training Wall and the Treadmill After It

The evolution of large language models from early tools to advanced systems like "Spud" is examined, critiquing computational scaling's sustainability, exploring open-source vs. corporate control, and addressing challenges in pre-training limitations, synthetic data reliance, and AI profitability in a rapidly advancing industry.

2 Apr 2026 Story: The Aging Programmer

Aging in software development faces stereotypes about relevance, physical/mental changes, workplace ageism, and legacy system reliance, but offers opportunities for growth, adaptability, and meaningful contributions through inclusive practices, assistive tech, documentation, and proactive engagement.

4 Feb 2026 Notes: The Universal Paperclip Clicker

Feeling overwhelmed by the pressure to constantly boost productivity using AI coding agents, a creative struggles with the unsustainable pace and blurs the line between work and personal life.

More CoRecursive: Coding Stories episodes