Uncategorized
The structural data bottleneck in AI antibody discovery

Dan Benjamin, Co-Founder & Chief Technology Officer of Immuto Scientific, examines why limited antibody–antigen co-structure datasets may be constraining AI model performance—and how high-throughput structural generation could reshape the field.
Artificial intelligence now plays a central role in antibody discovery. Machine learning models assist in sequence design, paratope prediction, affinity maturation, and developability screening. Yet despite increasingly sophisticated architectures, model performance often degrades when applied to novel antibody sequences or previously uncharacterised targets.
The limitation may not be algorithmic. It may be structural.
The scale problem in antibody training data
Most AI systems for antibody discovery rely heavily on publicly available structural data—primarily antibody–antigen co-crystal and cryo-EM structures deposited in the Protein Data Bank (PDB). While invaluable, the number of unique antibody–antigen complexes remains relatively small—on the order of only a few thousand nonredundant structures.
For a field attempting to generalise across immense antibody sequence diversity, this represents a narrow sampling of interaction space.
Antibody repertoires span millions to billions of possible sequences. Yet structural training data captures only a tiny fraction of that diversity. When models trained on this limited structural universe are asked to predict binding for novel antibodies—particularly those far from known scaffolds—they often struggle to accurately model interaction geometry.
The issue is not that existing structures are flawed. It is that there are not enough of them.
Generalisation requires interaction diversity
AI models in antibody discovery are often evaluated on their ability to predict properties within known sequence families. However, real-world discovery requires generalisation beyond previously observed binders.
Failure modes frequently emerge when models encounter:
- Novel CDR loop geometries
- Unseen epitope topologies
- Unusual binding interactions
- Rare scaffold architectures
These are precisely the cases where additional co-structural examples would improve model learning. Expanding the diversity of antibody–antigen interaction geometries—rather than only refining model architecture—may be essential for building truly generalisable systems.
In this sense, structural data functions as infrastructure. Without sufficient interaction diversity, model sophistication alone cannot compensate.
The limitations of traditional structure generation
X-ray crystallography and cryo-electron microscopy have defined structural biology for decades. They provide atomic-resolution insight and remain foundational tools.
However, they are not optimised for scale.
Structure determination through crystallography or cryo-EM requires significant time, protein engineering, purification, stabilisation, and iterative optimisation. As a result, structural datasets accumulate slowly. Certain protein classes are overrepresented, while others remain sparsely characterised. Many antibody–antigen interactions are never structurally resolved due to throughput constraints rather than scientific irrelevance.
Additionally, these approaches typically capture a single stabilised conformation. While high resolution, they do not readily capture interaction dynamics in solution.
For AI systems attempting to learn the rules of molecular recognition, both diversity and dynamic representation matter.
Toward high-throughput, in-solution structural generation
Recent advances in structural mapping technologies now enable antibody–antigen complexes to be characterised in solution at substantially higher throughput. By integrating empirical interaction data with computational modeling, it is possible to generate full PDB-format co-structures constrained by experimental measurements.
Importantly, these workflows are compatible with scale.
Rather than solving tens of structures per year, structural generation can expand into the thousands—and potentially far beyond. Increasing the size of structural training datasets from approximately 2,000 public co-structures to 10,000, 50,000, or even 100,000 diverse examples would meaningfully alter the landscape for AI antibody design.
Such expansion does not replace computational modeling; it strengthens it. Empirical constraints reduce structural ambiguity and anchor models in experimentally derived interaction data.
Structural data as competitive infrastructure
In AI-driven antibody discovery, competitive advantage is often framed in terms of proprietary models. Yet models trained on similar public datasets inevitably converge toward similar performance ceilings.
Industrialised structural generation enables the creation of proprietary, high-diversity antibody–antigen interaction datasets. These datasets can serve as the foundation for internal model training, fine-tuning, and validation. Over time, organisations capable of systematically expanding their structural libraries may achieve more robust generalisation across novel targets and binders.
This reframes structural biology from a validation tool to a data engine.
Rather than asking which architecture performs best on a fixed dataset, the field may increasingly ask which organisations can generate the richest structural training datasets.
Redefining what limits AI in antibody discovery
The past decade has focused heavily on algorithmic innovation. The next phase may focus on dataset expansion.
Antibodies recognise three-dimensional structure through highly specific geometric complementarity. Training AI systems to predict and design these interactions requires exposure to broad and diverse examples of that geometry.
Expanding antibody–antigen structural datasets—particularly with scalable, in-solution methods—offers a pathway to improve generalization, robustness, and predictive accuracy across novel sequence space.
In antibody AI, the bottleneck may no longer be compute—it may be structural data volume.
About the author
A structural biologist and technology developer, Daniel Benjamin leads the company’s high-throughput structural proteomics and AI integration efforts for antibody discovery. His doctoral research focused on advancing mass spectrometry–based approaches for studying protein structure in biopharmaceutical research.
The post The structural data bottleneck in AI antibody discovery appeared first on Drug Discovery World (DDW).
Uncategorized
GSK goes beyond weight-loss with $1B buyout of Chinese siRNA specialist
With Siran Biotechnology under its fold, GSK will have access to a long-acting siRNA therapy that could induce weight loss while preserving lean mass, in addition to addressing other weight-related comorbidities.
Uncategorized
Hospital shootings steadily increased since 2000
Get your daily dose of health and medicine every weekday with STAT’s free newsletter Morning Rounds. Sign up here.
Good morning. I was charmed by this profile of tween life in America, both as a former tween girl and as a reporter. I laughed out loud at one tween asking her friend, the main subject of the story: “You’re still getting interviewed?”
Get your daily dose of health and medicine every weekday with STAT’s free newsletter Morning Rounds. Sign up here.
Good morning. I was charmed by this profile of tween life in America, both as a former tween girl and as a reporter. I laughed out loud at one tween asking her friend, the main subject of the story: “You’re still getting interviewed?”
Uncategorized
Angelini fortifies neurology portfolio with $4.1B buyout of Catalyst Pharma
Angelini fortifies neurology portfolio with $4.1B buyout of Catalyst Pharma
Angelini Pharma is spending $4.1 billion to buy Catalyst Pharmaceuticals and its trio of FDA-approved treatments for rare neurological diseases.
The buyout will add three medicines to Angelini’s portfolio: Firdapse for Lambert-Eaton myasthenic syndrome, Agamree … Read More
-
Uncategorized9 years agoThese ’90s fashion trends are making a comeback in 2017
-
Contributors9 years agoThe final 6 ‘Game of Thrones’ episodes might feel like a full season
-
Uncategorized9 years agoAccording to Dior Couture, this taboo fashion accessory is back
-
Uncategorized9 years agoUber and Lyft are finally available in all of New York State
-
Uncategorized9 years agoPhillies’ Aaron Altherr makes mind-boggling barehanded play
-
Uncategorized9 years agoThe old and New Edition cast comes together to perform
-
Uncategorized9 years agoSteph Curry finally got the contract he deserves from the Warriors
-
Uncategorized9 years agoDisney’s live-action Aladdin finally finds its stars