ESMFold2: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Published 27 May 2026

Duration: 01:10:12

ESMC leverages transformer-based models trained on 6.8 billion protein sequences to predict structures, design functional proteins, and uncover evolutionary patterns through scalable, data-driven approaches, while balancing evolutionary constraints with interpretability and addressing limitations in data diversity and model generalizability.

Episode Description

Editors note: In our first BioHub pod with Priscilla and Mark they discussed their acquisition of EvoScale, led by Alex Rives, who is now Head of Scie...

Overview

The podcast discusses advancements in using AI to model and predict protein biology, focusing on frameworks like Evolutionary Scale Modeling (ESMC) and transformer-based language models trained on vast protein sequence databases. These models aim to predict evolutionary patterns, generate functional proteins, and design therapeutic molecules like antibodies (e.g., SCFVs) by searching predictive models for specific criteria. The ESMC approach employs a "world modeling" framework to integrate structure prediction, sequence analysis, and mechanistic interpretability, with the latest "Cambrian" model open-sourced under an MIT license. Key challenges include balancing protein folding complexity with machine learning scalability, data biases favoring disease-related proteins, and the need for diverse training data, such as metagenomic sequences from environmental sources, to capture evolutionary and functional diversity across billions of protein sequences.

The models leverage scaling laws to improve performance as data and parameter size grow, revealing hierarchical feature structures in protein sequences that mirror biological principles. Applications include protein design via inverse modeling, identification of novel functional relationships among distantly related proteins, and improved structure prediction without reliance on traditional methods like multiple sequence alignments. However, limitations persist in generalizing to unobserved biological contexts and modeling dynamic cellular interactions. The discussion also highlights the integration of computational models with experimental techniques like cryo-electron tomography to refine digital representations of biology, aiming to create a "virtual cell" map of protein interactions. Future goals emphasize advancing scalable, data-driven biological research to accelerate discoveries in therapeutics, climate solutions, and personalized medicine.

What If

What if you leveraged metagenomic data to design novel therapeutic proteins for unexplored ecological niches?
- Move: Use the ESMC model's ability to invert for protein design, focusing on underrepresented metagenomic sequences (e.g., soil, hydrothermal vents) to generate binders for environmental toxins or pathogen targets.
- Why Now: Metagenomic data is now accessible via open-source databases like UNIRF, and ESMCs open-sourcing allows solo developers to experiment with protein design without prior domain knowledge.
- Expected Upside: Discover proteins with unique functional properties (e.g., thermostable enzymes, novel antibodies) that could address unmet therapeutic or industrial needs, accelerating drug discovery in niche markets.
What if you built a lightweight tool to generate SCFVs for targeting hard-to-reach disease markers using ESMCs sequence-to-structure capabilities?
- Move: Repurpose the ESMC model to design single-chain variable fragments (SCFVs) for non-traditional targets (e.g., intracellular proteins or post-translational modifications) by training it on a subset of its open-source parameters.
- Why Now: ESMCs open-sourcing under MIT license provides a foundation for rapid prototyping, and the models success in SCFV design validates its potential for solo developers to bypass traditional MSAs.
- Expected Upside: Create a scalable pipeline for therapeutic antibody design, reducing reliance on costly lab experiments and enabling rapid validation of novel targets in synthetic biology or personalized medicine.
What if you curated a dataset to address data bias in ESMC by prioritizing underrepresented protein families and applied it to improve model generalization?
- Move: Collect and annotate sequences from non-medical domains (e.g., extremophiles, plant-derived proteins) to augment ESMCs training data, then evaluate its impact on predictive accuracy for these groups.
- Why Now: ESMCs performance is limited by its current datasets skew toward disease-related proteins, and solo developers can leverage open-source tools like Biohubs UNIRF or MetagenomicSeq to build specialized datasets.
- Expected Upside: Enhance the models versatility for applications like sustainable chemistry or agri-tech, while contributing to the broader scientific community via open-source sharing of curated datasets.

Takeaway

Integrate open-source AI models like ESMC into your workflow to leverage their protein design, structure prediction, and functional analysis capabilities, accelerating development without needing to build models from scratch.
Curate diverse, non-redundant protein sequence datasets (e.g., using metagenomic sources or UNIRF) to train or fine-tune models, ensuring coverage of functional and evolutionary diversity to improve generalization and reduce data bias.
Apply sparse coding techniques (e.g., SAEs) to analyze and interpret your models representation space, uncovering biological patterns or features that can guide iterative improvements in design or prediction accuracy.
Prioritize computational efficiency and data scalability by adopting model architectures and training strategies aligned with scaling laws (e.g., expanding data volumes over compute), ensuring your models remain performant as datasets grow.
Explore therapeutic applications like antibody design (e.g., SCFVs) using AI-driven protein design tools, targeting high-demand areas (e.g., drug development) where predictive modeling can directly address functional design challenges.

Recent Episodes of Latent Space

24 Jun 2026 Why the Frontier Ecosystem must be Open Matei Zaharia and Reynold Xin, Databricks

Databricks' expansion from a Berkeley meetup to a 100,000-attendee event, coupled with initiatives like OmniGens, Open Sharing, and Genie, addresses agent interoperability, open data formats, cloud security, scalable analytics, and evolving database architectures, while emphasizing open ecosystems and customer-driven AI innovation.

22 Jun 2026 Red-Teaming after Mythos Zico Kolter & Matt Fredrikson, Gray Swan

AI security challenges in large language models, such as data leakage and prompt injection, require adversarial testing, red teaming, tools like *Shade* and *Signal*, and structured frameworks to address integration risks, robustness gaps, and enterprise-specific security demands.

3 Jun 2026 Scaling Past Informal AI - Carina Hong, Axiom Math

Formal verification is positioned as a critical tool for advancing AI by ensuring system correctness through mathematical rigor, exemplified by Axiom Math's achievements, tools like Lean, challenges in AI generalization, and the vision of AI as a "superhuman mathematician" through verified reasoning.

3 Jun 2026 Satya Nadella: No Priors x Latent Space Crossover Special at Microsoft Build

Strategic AI development shifts to ecosystem-driven frameworks prioritizing value creation, covering Microsoft's rigorous model training, agent-driven workflow management, real-world impact challenges, innovative business models, inclusive AI participation, and redefining work through agentic systems.

2 Jun 2026 GitHub's plan for Agents Kyle Daigle, GitHub

Advanced AI integration in developer workflows leverages tools like GitHub Copilot and agentic systems to automate tasks and boost productivity, while addressing challenges like skill bloat, security, open-source trust issues, and the shift to modular AI capabilities in enterprise and collaborative environments.

More Latent Space episodes