Uncategorized

The structural data bottleneck in AI antibody discovery

Published

on

Dan Benjamin, Co-Founder & Chief Technology Officer of Immuto Scientific, examines why limited antibody–antigen co-structure datasets may be constraining AI model performance—and how high-throughput structural generation could reshape the field.

Artificial intelligence now plays a central role in antibody discovery. Machine learning models assist in sequence design, paratope prediction, affinity maturation, and developability screening. Yet despite increasingly sophisticated architectures, model performance often degrades when applied to novel antibody sequences or previously uncharacterised targets.

The limitation may not be algorithmic. It may be structural.

The scale problem in antibody training data

Most AI systems for antibody discovery rely heavily on publicly available structural data—primarily antibody–antigen co-crystal and cryo-EM structures deposited in the Protein Data Bank (PDB). While invaluable, the number of unique antibody–antigen complexes remains relatively small—on the order of only a few thousand nonredundant structures.

For a field attempting to generalise across immense antibody sequence diversity, this represents a narrow sampling of interaction space.

Antibody repertoires span millions to billions of possible sequences. Yet structural training data captures only a tiny fraction of that diversity. When models trained on this limited structural universe are asked to predict binding for novel antibodies—particularly those far from known scaffolds—they often struggle to accurately model interaction geometry.

The issue is not that existing structures are flawed. It is that there are not enough of them.

Generalisation requires interaction diversity

AI models in antibody discovery are often evaluated on their ability to predict properties within known sequence families. However, real-world discovery requires generalisation beyond previously observed binders.

Failure modes frequently emerge when models encounter:

  • Novel CDR loop geometries
  • Unseen epitope topologies
  • Unusual binding interactions
  • Rare scaffold architectures

These are precisely the cases where additional co-structural examples would improve model learning. Expanding the diversity of antibody–antigen interaction geometries—rather than only refining model architecture—may be essential for building truly generalisable systems.

In this sense, structural data functions as infrastructure. Without sufficient interaction diversity, model sophistication alone cannot compensate.

The limitations of traditional structure generation

X-ray crystallography and cryo-electron microscopy have defined structural biology for decades. They provide atomic-resolution insight and remain foundational tools.

However, they are not optimised for scale.

Structure determination through crystallography or cryo-EM requires significant time, protein engineering, purification, stabilisation, and iterative optimisation. As a result, structural datasets accumulate slowly. Certain protein classes are overrepresented, while others remain sparsely characterised. Many antibody–antigen interactions are never structurally resolved due to throughput constraints rather than scientific irrelevance.

Additionally, these approaches typically capture a single stabilised conformation. While high resolution, they do not readily capture interaction dynamics in solution.

For AI systems attempting to learn the rules of molecular recognition, both diversity and dynamic representation matter.

Toward high-throughput, in-solution structural generation

Recent advances in structural mapping technologies now enable antibody–antigen complexes to be characterised in solution at substantially higher throughput. By integrating empirical interaction data with computational modeling, it is possible to generate full PDB-format co-structures constrained by experimental measurements.

Importantly, these workflows are compatible with scale.

Rather than solving tens of structures per year, structural generation can expand into the thousands—and potentially far beyond. Increasing the size of structural training datasets from approximately 2,000 public co-structures to 10,000, 50,000, or even 100,000 diverse examples would meaningfully alter the landscape for AI antibody design.

Such expansion does not replace computational modeling; it strengthens it. Empirical constraints reduce structural ambiguity and anchor models in experimentally derived interaction data.

Structural data as competitive infrastructure

In AI-driven antibody discovery, competitive advantage is often framed in terms of proprietary models. Yet models trained on similar public datasets inevitably converge toward similar performance ceilings.

Industrialised structural generation enables the creation of proprietary, high-diversity antibody–antigen interaction datasets. These datasets can serve as the foundation for internal model training, fine-tuning, and validation. Over time, organisations capable of systematically expanding their structural libraries may achieve more robust generalisation across novel targets and binders.

This reframes structural biology from a validation tool to a data engine.

Rather than asking which architecture performs best on a fixed dataset, the field may increasingly ask which organisations can generate the richest structural training datasets.

Redefining what limits AI in antibody discovery

The past decade has focused heavily on algorithmic innovation. The next phase may focus on dataset expansion.

Antibodies recognise three-dimensional structure through highly specific geometric complementarity. Training AI systems to predict and design these interactions requires exposure to broad and diverse examples of that geometry.

Expanding antibody–antigen structural datasets—particularly with scalable, in-solution methods—offers a pathway to improve generalization, robustness, and predictive accuracy across novel sequence space.

In antibody AI, the bottleneck may no longer be compute—it may be structural data volume.

About the author 

A structural biologist and technology developer, Daniel Benjamin leads the company’s high-throughput structural proteomics and AI integration efforts for antibody discovery. His doctoral research focused on advancing mass spectrometry–based approaches for studying protein structure in biopharmaceutical research.

The post The structural data bottleneck in AI antibody discovery appeared first on Drug Discovery World (DDW).

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending

Exit mobile version