Predicting protein binding and engineering antibodies with AlphaBind
A-Alpha Bio was founded in 2017 at the University of Washington’s Institute for Protein Design and Center for Synthetic Biology. From day one, we’ve been on a mission to unlock the potential of protein-protein interactions to improve human health. Today we have two exciting announcements to share that demonstrate the progress we’ve made in harnessing synthetic biology and machine learning to measure, predict, and engineer protein-protein interactions:
- We just crossed a major milestone by experimentally measuring our billionth protein-protein interaction using our synthetic biology platform, AlphaSeq.
- Today, we also released our AlphaBind antibody optimization model, which uses antibody-antigen binding data to engineer effective therapeutics. Our work building and benchmarking AlphaBind is described in a pre-print on bioRxiv, and anyone interested in accessing the model (released under the MIT license) can find code, documentation and tutorials in the associated GitHub repo.
Why focus on protein-protein interactions?
The way proteins interact with each other—how they bind—drives their function and has implications for nearly every disease from viral infection to cancer. The ability to reliably engineer protein-protein interactions would cure a myriad of diseases across numerous modalities from antibodies to molecular glues.
Structure prediction tools, including AlphaFold, have largely solved one major issue in protein engineering. Despite the rapid advances in structure-based models, however, it is not currently possible to reliably predict the binding strength of two protein sequences or to engineer that pair of proteins to bind better even if we can predict a correct structure—and therefore it is not possible to predict function from sequence using structure-based models alone.
To engineer protein-protein interactions, we need quantitative protein-protein binding data; however, these data are exceedingly scarce because measuring the affinity of two proteins typically requires recombinant protein expression, purification, and a low-throughput binding assay. That is the problem that we are solving at A-Alpha Bio.
The A-Alpha Bio Advantage
A-Alpha Bio has the unique ability to generate protein-protein binding data at scale. In September 2024, we measured our billionth protein-protein interaction. Over half of these measurements were made in the past year, and the rate at which we’re able to generate data continues to increase. Moreover, these measured interactions are highly diverse, encompassing intracellular proteins, extracellular and cell-surface proteins, antibodies in numerous formats, viral and bacterial proteins, synthetic proteins, and more. AlphaSeq datasets are quantitative, highly reproducible, and multidimensional—all characteristics needed to train an effective computational protein binding model. Additionally, AlphaSeq is fast—going from digital designs to measured binding data takes about one month—enabling rapid turnaround for iterative engineering, efficient model validation, and fine-tuning.
Experimental approaches for quantitatively measuring protein-protein binding strength are notoriously low throughput—requiring recombinant protein expression and specialized instrumentation to measure interactions one at a time—which has resulted in data scarcity. For example —all published binding measurements are assembled into a public database, called BioGrid, which currently contains about two million entries. In comparison, that’s roughly equivalent to the scale of data we generate using AlphaSeq each day. Additionally, since all datasets are generated using the same platform and with consistent controls, measurements and data format are interoperable. By eliminating data scarcity for protein binding, we aim to learn the direct relationship between protein sequence and protein function.
Introducing AlphaBind for antibody optimization
The power of our unique, data-fueled approach is exemplified in our recently released preprint, where we demonstrate that our AlphaBind platform leads the industry for guided engineering of antibodies. We show that AlphaBind—pre-trained on a selection of approximately 7.5 million antibody-antigen binding measurements from our database—enables rapid antibody optimization by predicting strong binders after just one round of unguided data generation in the lab. Across four unique antibody-antigen systems, AlphaBind generated thousands of successful antibody variants up to 11 mutations from their parental sequences. In addition to testing thousands of candidates in AlphaSeq, we tested 10 candidates by BLI, again including up to 11 mutations, and found that 100% of our candidates expressed and 100% had improved binding over their respective parental antibodies. Importantly, while AlphaSeq training data performs best, we also show that other types of experimental data, such as semi-quantitative mammalian display data, can also be used with the pre-trained AlphaBind model to perform rapid and effective antibody optimization. Along with the publication, we have made the pre-trained AlphaBind model and the fine-tuning AlphaSeq datasets publicly available.
With this work, we have shown that AlphaSeq and AlphaBind do exceptionally well on an important and practical problem that has traditionally taken multiple rounds of iterative design and model training. Most critically, we demonstrate that AlphaBind can generate thousands of sequence-diverse and high-affinity antibody variants with just one lab experiment and several hundred dollars’ compute, which allows us to further engineer antibodies with the best possible properties—i.e., highly soluble, easy to express, thermostable, etc.—while ensuring that on-target affinity remains high. We demonstrate this capability by designing 7 additional mutations onto one of our AlphaBind-optimized TIGIT antibodies to fix two predicted developability liabilities and revert the sequence to human Vh germline—entirely in silico. Then, with experimental assessment of only 12 candidates, we validate a candidate with good expression titer, no major sequence liabilities, a near-germline Vh sequence, and 300 femtomolar binding affinity. And this is only a single example, chosen to highlight the power of our platform approach—we believe that the ability of our ML models to simultaneously predict and optimize affinity and other properties is a critical step on the journey from discovering to designing therapeutics.
Where do we go from here?
AlphaBind is now fully integrated into our antibody discovery and optimization workflow for our pipeline and partnerships, where we are applying AlphaBind to solve major challenges in drug discovery, and to drive significant impact for human health. For example, we are engineering broadly cross-reactive binders for infectious diseases, like COVID-19 and HIV, and have demonstrated that AlphaBind allows us to predict and validate binders with cross-reactivity against many hundreds of diverse viral variants. Similarly, we are mapping interactions between diverse antibodies and variants of the most likely future biothreats to preemptively train binding models that will enable a rapid therapeutic response.
We are also using AlphaSeq data to drive breakthroughs in determining the structure of protein complexes. Not only are we able to identify residues at the binding interface, but we are also able to use two-sided mutational data as a proxy for co-evolution data to generate high-confidence structural predictions where other approaches have consistently failed, including for antibody-antigen structural modeling.
Finally, we continue to improve our platform. AlphaBind2 will be pre-trained on significantly more antibody-antigen binding data, representing a far greater diversity of parental antibodies, antigens, and epitopes. To generate these data, we are using not only AlphaSeq, but also our in-house ML-derived humanoid phage library, which we use to enrich and characterize binders by panning and subcloning into AlphaSeq to measure binding properties including affinity, specificity, cross-reactivity, and epitope.
We also continue to improve AlphaBind through head-to-head experimental validations of various model architectures, embeddings, and training data inputs. Models are trained in parallel and each model predicts thousands of optimized sequences, which are synthesized and experimentally tested in AlphaSeq to quickly identify the most effective computational approaches. We demonstrated in our AlphaBind paper that we can effectively optimize antibodies with just one round of AlphaSeq data – the next milestones for AlphaBind are zero-shot optimization (optimize parental antibodies with no additional data generation), and then de novo design (make antibodies in silico, entirely from scratch).
We are excited to hear more from those using our model to pursue their interests and research agendas. In opening this resource to the world, we hope to inspire new ideas and generate new opportunities for collaboration that contribute to our ultimate goal of improving human health.
Thank you for being part of our journey!
– David Younger and Randolph Lopez