Uncategorized
The structural data bottleneck in AI antibody discovery

Dan Benjamin, Co-Founder & Chief Technology Officer of Immuto Scientific, examines why limited antibody–antigen co-structure datasets may be constraining AI model performance—and how high-throughput structural generation could reshape the field.
Artificial intelligence now plays a central role in antibody discovery. Machine learning models assist in sequence design, paratope prediction, affinity maturation, and developability screening. Yet despite increasingly sophisticated architectures, model performance often degrades when applied to novel antibody sequences or previously uncharacterised targets.
The limitation may not be algorithmic. It may be structural.
The scale problem in antibody training data
Most AI systems for antibody discovery rely heavily on publicly available structural data—primarily antibody–antigen co-crystal and cryo-EM structures deposited in the Protein Data Bank (PDB). While invaluable, the number of unique antibody–antigen complexes remains relatively small—on the order of only a few thousand nonredundant structures.
For a field attempting to generalise across immense antibody sequence diversity, this represents a narrow sampling of interaction space.
Antibody repertoires span millions to billions of possible sequences. Yet structural training data captures only a tiny fraction of that diversity. When models trained on this limited structural universe are asked to predict binding for novel antibodies—particularly those far from known scaffolds—they often struggle to accurately model interaction geometry.
The issue is not that existing structures are flawed. It is that there are not enough of them.
Generalisation requires interaction diversity
AI models in antibody discovery are often evaluated on their ability to predict properties within known sequence families. However, real-world discovery requires generalisation beyond previously observed binders.
Failure modes frequently emerge when models encounter:
- Novel CDR loop geometries
- Unseen epitope topologies
- Unusual binding interactions
- Rare scaffold architectures
These are precisely the cases where additional co-structural examples would improve model learning. Expanding the diversity of antibody–antigen interaction geometries—rather than only refining model architecture—may be essential for building truly generalisable systems.
In this sense, structural data functions as infrastructure. Without sufficient interaction diversity, model sophistication alone cannot compensate.
The limitations of traditional structure generation
X-ray crystallography and cryo-electron microscopy have defined structural biology for decades. They provide atomic-resolution insight and remain foundational tools.
However, they are not optimised for scale.
Structure determination through crystallography or cryo-EM requires significant time, protein engineering, purification, stabilisation, and iterative optimisation. As a result, structural datasets accumulate slowly. Certain protein classes are overrepresented, while others remain sparsely characterised. Many antibody–antigen interactions are never structurally resolved due to throughput constraints rather than scientific irrelevance.
Additionally, these approaches typically capture a single stabilised conformation. While high resolution, they do not readily capture interaction dynamics in solution.
For AI systems attempting to learn the rules of molecular recognition, both diversity and dynamic representation matter.
Toward high-throughput, in-solution structural generation
Recent advances in structural mapping technologies now enable antibody–antigen complexes to be characterised in solution at substantially higher throughput. By integrating empirical interaction data with computational modeling, it is possible to generate full PDB-format co-structures constrained by experimental measurements.
Importantly, these workflows are compatible with scale.
Rather than solving tens of structures per year, structural generation can expand into the thousands—and potentially far beyond. Increasing the size of structural training datasets from approximately 2,000 public co-structures to 10,000, 50,000, or even 100,000 diverse examples would meaningfully alter the landscape for AI antibody design.
Such expansion does not replace computational modeling; it strengthens it. Empirical constraints reduce structural ambiguity and anchor models in experimentally derived interaction data.
Structural data as competitive infrastructure
In AI-driven antibody discovery, competitive advantage is often framed in terms of proprietary models. Yet models trained on similar public datasets inevitably converge toward similar performance ceilings.
Industrialised structural generation enables the creation of proprietary, high-diversity antibody–antigen interaction datasets. These datasets can serve as the foundation for internal model training, fine-tuning, and validation. Over time, organisations capable of systematically expanding their structural libraries may achieve more robust generalisation across novel targets and binders.
This reframes structural biology from a validation tool to a data engine.
Rather than asking which architecture performs best on a fixed dataset, the field may increasingly ask which organisations can generate the richest structural training datasets.
Redefining what limits AI in antibody discovery
The past decade has focused heavily on algorithmic innovation. The next phase may focus on dataset expansion.
Antibodies recognise three-dimensional structure through highly specific geometric complementarity. Training AI systems to predict and design these interactions requires exposure to broad and diverse examples of that geometry.
Expanding antibody–antigen structural datasets—particularly with scalable, in-solution methods—offers a pathway to improve generalization, robustness, and predictive accuracy across novel sequence space.
In antibody AI, the bottleneck may no longer be compute—it may be structural data volume.
About the author
A structural biologist and technology developer, Daniel Benjamin leads the company’s high-throughput structural proteomics and AI integration efforts for antibody discovery. His doctoral research focused on advancing mass spectrometry–based approaches for studying protein structure in biopharmaceutical research.
The post The structural data bottleneck in AI antibody discovery appeared first on Drug Discovery World (DDW).
Uncategorized
Angelini fortifies neurology portfolio with $4.1B buyout of Catalyst Pharma
Angelini fortifies neurology portfolio with $4.1B buyout of Catalyst Pharma
Angelini Pharma is spending $4.1 billion to buy Catalyst Pharmaceuticals and its trio of FDA-approved treatments for rare neurological diseases.
The buyout will add three medicines to Angelini’s portfolio: Firdapse for Lambert-Eaton myasthenic syndrome, Agamree … Read More
Uncategorized
STAT+: Color me skeptical: Drinking gold is not an ALS cure
This is the online version of Adam’s Biotech Scorecard, a subscriber-only newsletter. STAT+ subscribers can sign up here to get it delivered to their inbox.
It’s been a while since I wrote a “Mean Adam” newsletter.
The biotech company Clene is developing a treatment for ALS called CNM-Au8 that it describes as a “highly concentrated aqueous suspension of catalytically-active, clean-surfaced, faceted gold nanocrystals.”
Allow me to translate: The Clene “drug” is gold microdust suspended in water.
This is the online version of Adam’s Biotech Scorecard, a subscriber-only newsletter. STAT+ subscribers can sign up here to get it delivered to their inbox.
It’s been a while since I wrote a “Mean Adam” newsletter.
The biotech company Clene is developing a treatment for ALS called CNM-Au8 that it describes as a “highly concentrated aqueous suspension of catalytically-active, clean-surfaced, faceted gold nanocrystals.”
Allow me to translate: The Clene “drug” is gold microdust suspended in water.
Uncategorized
STAT+: Angelini Pharma buys Catalyst Pharmaceuticals and its rare disease drugs for $4.1B
The Italian company Angelini Pharma said Thursday it would buy the rare-disease focused Catalyst Pharmaceuticals for roughly $4.1 billion in cash.
The deal values Florida-based Catalyst at $31.50 a share, a 28% premium to the 30-day period before April 22, when it became publicly known that a deal was in the works.
Buying Catalyst, which sells three approved medicines, will give Angelini a foothold in the U.S. market. It also builds on its work in neurology.
The Italian company Angelini Pharma said Thursday it would buy the rare-disease focused Catalyst Pharmaceuticals for roughly $4.1 billion in cash.
The deal values Florida-based Catalyst at $31.50 a share, a 28% premium to the 30-day period before April 22, when it became publicly known that a deal was in the works.
Buying Catalyst, which sells three approved medicines, will give Angelini a foothold in the U.S. market. It also builds on its work in neurology.
-
Uncategorized9 years agoThese ’90s fashion trends are making a comeback in 2017
-
Contributors9 years agoThe final 6 ‘Game of Thrones’ episodes might feel like a full season
-
Uncategorized9 years agoAccording to Dior Couture, this taboo fashion accessory is back
-
Uncategorized9 years agoUber and Lyft are finally available in all of New York State
-
Uncategorized9 years agoPhillies’ Aaron Altherr makes mind-boggling barehanded play
-
Uncategorized9 years agoThe old and New Edition cast comes together to perform
-
Uncategorized9 years agoSteph Curry finally got the contract he deserves from the Warriors
-
Uncategorized9 years agoDisney’s live-action Aladdin finally finds its stars