More Latent Space episodes

ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub thumbnail

ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Published 27 May 2026

Duration: 01:10:12

ESMC leverages transformer-based models trained on 6.8 billion protein sequences to predict structures, design functional proteins, and uncover evolutionary patterns through scalable, data-driven approaches, while balancing evolutionary constraints with interpretability and addressing limitations in data diversity and model generalizability.

Episode Description

Editors note: In our first BioHub pod with Priscilla and Mark they discussed their acquisition of EvoScale, led by Alex Rives, who is now Head of Scie...

Overview

The podcast discusses advancements in using AI to model and predict protein biology, focusing on frameworks like Evolutionary Scale Modeling (ESMC) and transformer-based language models trained on vast protein sequence databases. These models aim to predict evolutionary patterns, generate functional proteins, and design therapeutic molecules like antibodies (e.g., SCFVs) by searching predictive models for specific criteria. The ESMC approach employs a "world modeling" framework to integrate structure prediction, sequence analysis, and mechanistic interpretability, with the latest "Cambrian" model open-sourced under an MIT license. Key challenges include balancing protein folding complexity with machine learning scalability, data biases favoring disease-related proteins, and the need for diverse training data, such as metagenomic sequences from environmental sources, to capture evolutionary and functional diversity across billions of protein sequences.

The models leverage scaling laws to improve performance as data and parameter size grow, revealing hierarchical feature structures in protein sequences that mirror biological principles. Applications include protein design via inverse modeling, identification of novel functional relationships among distantly related proteins, and improved structure prediction without reliance on traditional methods like multiple sequence alignments. However, limitations persist in generalizing to unobserved biological contexts and modeling dynamic cellular interactions. The discussion also highlights the integration of computational models with experimental techniques like cryo-electron tomography to refine digital representations of biology, aiming to create a "virtual cell" map of protein interactions. Future goals emphasize advancing scalable, data-driven biological research to accelerate discoveries in therapeutics, climate solutions, and personalized medicine.

What If

  • What if you leveraged metagenomic data to design novel therapeutic proteins for unexplored ecological niches?

    • Move: Use the ESMC model's ability to invert for protein design, focusing on underrepresented metagenomic sequences (e.g., soil, hydrothermal vents) to generate binders for environmental toxins or pathogen targets.
    • Why Now: Metagenomic data is now accessible via open-source databases like UNIRF, and ESMCs open-sourcing allows solo developers to experiment with protein design without prior domain knowledge.
    • Expected Upside: Discover proteins with unique functional properties (e.g., thermostable enzymes, novel antibodies) that could address unmet therapeutic or industrial needs, accelerating drug discovery in niche markets.
  • What if you built a lightweight tool to generate SCFVs for targeting hard-to-reach disease markers using ESMCs sequence-to-structure capabilities?

    • Move: Repurpose the ESMC model to design single-chain variable fragments (SCFVs) for non-traditional targets (e.g., intracellular proteins or post-translational modifications) by training it on a subset of its open-source parameters.
    • Why Now: ESMCs open-sourcing under MIT license provides a foundation for rapid prototyping, and the models success in SCFV design validates its potential for solo developers to bypass traditional MSAs.
    • Expected Upside: Create a scalable pipeline for therapeutic antibody design, reducing reliance on costly lab experiments and enabling rapid validation of novel targets in synthetic biology or personalized medicine.
  • What if you curated a dataset to address data bias in ESMC by prioritizing underrepresented protein families and applied it to improve model generalization?

    • Move: Collect and annotate sequences from non-medical domains (e.g., extremophiles, plant-derived proteins) to augment ESMCs training data, then evaluate its impact on predictive accuracy for these groups.
    • Why Now: ESMCs performance is limited by its current datasets skew toward disease-related proteins, and solo developers can leverage open-source tools like Biohubs UNIRF or MetagenomicSeq to build specialized datasets.
    • Expected Upside: Enhance the models versatility for applications like sustainable chemistry or agri-tech, while contributing to the broader scientific community via open-source sharing of curated datasets.

Takeaway

  • Integrate open-source AI models like ESMC into your workflow to leverage their protein design, structure prediction, and functional analysis capabilities, accelerating development without needing to build models from scratch.

  • Curate diverse, non-redundant protein sequence datasets (e.g., using metagenomic sources or UNIRF) to train or fine-tune models, ensuring coverage of functional and evolutionary diversity to improve generalization and reduce data bias.

  • Apply sparse coding techniques (e.g., SAEs) to analyze and interpret your models representation space, uncovering biological patterns or features that can guide iterative improvements in design or prediction accuracy.

  • Prioritize computational efficiency and data scalability by adopting model architectures and training strategies aligned with scaling laws (e.g., expanding data volumes over compute), ensuring your models remain performant as datasets grow.

  • Explore therapeutic applications like antibody design (e.g., SCFVs) using AI-driven protein design tools, targeting high-demand areas (e.g., drug development) where predictive modeling can directly address functional design challenges.

Recent Episodes of Latent Space

21 May 2026 Giving Agents Computers Ivan Burazin, Daytona

A company evolved from pre-Docker browser-based IDEs and developer events to modern sandboxing platforms prioritizing AI agent infrastructure, leveraging bare-metal compute for scalability and addressing market demands with open-source strategies, spiky workloads, and future AI Cloud expansion amid GPU shortages.

20 May 2026 Railway: The Agent-Native Cloud Jake Cooper

Railway streamlines app deployment with AI-driven tools, environment cloning, and parallel testing, leveraging kernel patching and custom storage while addressing challenges like compute scarcity and AI agent coordination, alongside critiques of Git/GitHub and traditional software lifecycle practices.

5 May 2026 Doing Vibe Physics Alex Lupsasca, OpenAI

AI is advancing theoretical physics by rapidly solving complex problems like quantum field theory calculations and simulating models such as SYK, though it still relies on human collaboration for original insights and contextual validation, reshaping research methodologies and education.

23 Apr 2026 AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

The text discusses AI's evolving landscape, focusing on experimental agents potentially breaking containment by 2026, market disruptions from foundation models, infrastructure advancements like RAG, debates between infrastructure and application firms, outsourcing strategies, pre-2023 training data advantages, competitive coding AI sectors, and future trends in personalization and industry transformation amid scalability and quality challenges.

More Latent Space episodes