Helixgate

Helixgate

Uncategorized

The structural data bottleneck in AI antibody discovery

Published

on

Dan Benjamin, Co-Founder & Chief Technology Officer of Immuto Scientific, examines why limited antibody–antigen co-structure datasets may be constraining AI model performance—and how high-throughput structural generation could reshape the field.

Artificial intelligence now plays a central role in antibody discovery. Machine learning models assist in sequence design, paratope prediction, affinity maturation, and developability screening. Yet despite increasingly sophisticated architectures, model performance often degrades when applied to novel antibody sequences or previously uncharacterised targets.

The limitation may not be algorithmic. It may be structural.

The scale problem in antibody training data

Most AI systems for antibody discovery rely heavily on publicly available structural data—primarily antibody–antigen co-crystal and cryo-EM structures deposited in the Protein Data Bank (PDB). While invaluable, the number of unique antibody–antigen complexes remains relatively small—on the order of only a few thousand nonredundant structures.

For a field attempting to generalise across immense antibody sequence diversity, this represents a narrow sampling of interaction space.

Antibody repertoires span millions to billions of possible sequences. Yet structural training data captures only a tiny fraction of that diversity. When models trained on this limited structural universe are asked to predict binding for novel antibodies—particularly those far from known scaffolds—they often struggle to accurately model interaction geometry.

The issue is not that existing structures are flawed. It is that there are not enough of them.

Generalisation requires interaction diversity

AI models in antibody discovery are often evaluated on their ability to predict properties within known sequence families. However, real-world discovery requires generalisation beyond previously observed binders.

Failure modes frequently emerge when models encounter:

  • Novel CDR loop geometries
  • Unseen epitope topologies
  • Unusual binding interactions
  • Rare scaffold architectures

These are precisely the cases where additional co-structural examples would improve model learning. Expanding the diversity of antibody–antigen interaction geometries—rather than only refining model architecture—may be essential for building truly generalisable systems.

In this sense, structural data functions as infrastructure. Without sufficient interaction diversity, model sophistication alone cannot compensate.

The limitations of traditional structure generation

X-ray crystallography and cryo-electron microscopy have defined structural biology for decades. They provide atomic-resolution insight and remain foundational tools.

However, they are not optimised for scale.

Structure determination through crystallography or cryo-EM requires significant time, protein engineering, purification, stabilisation, and iterative optimisation. As a result, structural datasets accumulate slowly. Certain protein classes are overrepresented, while others remain sparsely characterised. Many antibody–antigen interactions are never structurally resolved due to throughput constraints rather than scientific irrelevance.

Additionally, these approaches typically capture a single stabilised conformation. While high resolution, they do not readily capture interaction dynamics in solution.

For AI systems attempting to learn the rules of molecular recognition, both diversity and dynamic representation matter.

Toward high-throughput, in-solution structural generation

Recent advances in structural mapping technologies now enable antibody–antigen complexes to be characterised in solution at substantially higher throughput. By integrating empirical interaction data with computational modeling, it is possible to generate full PDB-format co-structures constrained by experimental measurements.

Importantly, these workflows are compatible with scale.

Rather than solving tens of structures per year, structural generation can expand into the thousands—and potentially far beyond. Increasing the size of structural training datasets from approximately 2,000 public co-structures to 10,000, 50,000, or even 100,000 diverse examples would meaningfully alter the landscape for AI antibody design.

Such expansion does not replace computational modeling; it strengthens it. Empirical constraints reduce structural ambiguity and anchor models in experimentally derived interaction data.

Structural data as competitive infrastructure

In AI-driven antibody discovery, competitive advantage is often framed in terms of proprietary models. Yet models trained on similar public datasets inevitably converge toward similar performance ceilings.

Industrialised structural generation enables the creation of proprietary, high-diversity antibody–antigen interaction datasets. These datasets can serve as the foundation for internal model training, fine-tuning, and validation. Over time, organisations capable of systematically expanding their structural libraries may achieve more robust generalisation across novel targets and binders.

This reframes structural biology from a validation tool to a data engine.

Rather than asking which architecture performs best on a fixed dataset, the field may increasingly ask which organisations can generate the richest structural training datasets.

Redefining what limits AI in antibody discovery

The past decade has focused heavily on algorithmic innovation. The next phase may focus on dataset expansion.

Antibodies recognise three-dimensional structure through highly specific geometric complementarity. Training AI systems to predict and design these interactions requires exposure to broad and diverse examples of that geometry.

Expanding antibody–antigen structural datasets—particularly with scalable, in-solution methods—offers a pathway to improve generalization, robustness, and predictive accuracy across novel sequence space.

In antibody AI, the bottleneck may no longer be compute—it may be structural data volume.

About the author 

A structural biologist and technology developer, Daniel Benjamin leads the company’s high-throughput structural proteomics and AI integration efforts for antibody discovery. His doctoral research focused on advancing mass spectrometry–based approaches for studying protein structure in biopharmaceutical research.

The post The structural data bottleneck in AI antibody discovery appeared first on Drug Discovery World (DDW).

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Uncategorized

Angelini fortifies neurology portfolio with $4.1B buyout of Catalyst Pharma

Angelini fortifies neurology portfolio with $4.1B buyout of Catalyst Pharma

Published

on

Angelini Pharma is spending $4.1 billion to buy Catalyst Pharmaceuticals and its trio of FDA-approved treatments for rare neurological diseases.

The buyout will add three medicines to Angelini’s portfolio: Firdapse for Lambert-Eaton myasthenic syndrome, Agamree …​ ​Read More

Continue Reading

Uncategorized

STAT+: Color me skeptical: Drinking gold is not an ALS cure

This is the online version of Adam’s Biotech Scorecard, a subscriber-only newsletter. STAT+ subscribers can sign up here to get it delivered to their inbox.

It’s been a while since I wrote a “Mean Adam” newsletter.

The biotech company Clene is developing a treatment for ALS called CNM-Au8 that it describes as a “highly concentrated aqueous suspension of catalytically-active, clean-surfaced, faceted gold nanocrystals.”

Allow me to translate: The Clene “drug” is gold microdust suspended in water.

Continue to STAT+ to read the full story…

Read More

Published

on

This is the online version of Adam’s Biotech Scorecard, a subscriber-only newsletter. STAT+ subscribers can sign up here to get it delivered to their inbox.

It’s been a while since I wrote a “Mean Adam” newsletter.

The biotech company Clene is developing a treatment for ALS called CNM-Au8 that it describes as a “highly concentrated aqueous suspension of catalytically-active, clean-surfaced, faceted gold nanocrystals.”

Allow me to translate: The Clene “drug” is gold microdust suspended in water.

Continue to STAT+ to read the full story…

Read More

Continue Reading

Uncategorized

STAT+: Angelini Pharma buys Catalyst Pharmaceuticals and its rare disease drugs for $4.1B

The Italian company Angelini Pharma said Thursday it would buy the rare-disease focused Catalyst Pharmaceuticals for roughly $4.1 billion in cash. 

The deal values Florida-based Catalyst at $31.50 a share, a 28% premium to the 30-day period before April 22, when it became publicly known that a deal was in the works. 

Buying Catalyst, which sells three approved medicines, will give Angelini a foothold in the U.S. market. It also builds on its work in neurology. 

Continue to STAT+ to read the full story…

Read More

Published

on

The Italian company Angelini Pharma said Thursday it would buy the rare-disease focused Catalyst Pharmaceuticals for roughly $4.1 billion in cash. 

The deal values Florida-based Catalyst at $31.50 a share, a 28% premium to the 30-day period before April 22, when it became publicly known that a deal was in the works. 

Buying Catalyst, which sells three approved medicines, will give Angelini a foothold in the U.S. market. It also builds on its work in neurology. 

Continue to STAT+ to read the full story…

Read More

Continue Reading
Advertisement

Trending