Protein engineering is a perfect field for artificial intelligence research. Each protein is made up of amino acids. To optimize protein function, researchers modify proteins by replacing one of 20 different amino acids with another. For a protein that is only 50 amino acids long, there are approximately 1.13×1065 possible combinations to test. That’s 113 followed by 65 zeros, or 5 times 1 trillion zeros.
With so many potential combinations and impossible to test in the lab, protein engineering is an ideal challenge for AI. Modeling which of these combinations yields the best results is a perfect problem for the vast computational power of this technology. However, the performance of AI is determined by the data used to train it. In some areas of protein engineering, adequate data did not exist.
One of the biggest bottlenecks in AI-guided protein engineering is coming up with machine learning models. We are generating adequate and sufficient experimental data to train them. Optimizing Protein Function When manipulating protein activity, we had a very clear problem. The problem was that there wasn’t enough dataset to train an accurate model. ”
Han Xiao, Professor of Chemistry, Biological Sciences, and Bioengineering at Rice University and Director of the SynthX Center
To be able to generate an AI model that can accurately predict how to optimize a protein’s function and activity, Xiao’s team first needed to generate enough activity data about a specific protein to train the AI model. Recently nature biotechnology In this publication, Xiao’s team and collaborators from Johns Hopkins University and Microsoft have done just that, sharing an approach that provides the necessary data and creates accurate models in just three days.
This approach, called sequence display, can generate more than 10 million data points in a single experiment. These data points are input into a protein language AI model and used to predict which changes to a protein’s amino acids will result in desired changes in protein activity or function.
“We were able to develop an activity-based barcoding system that records the activity of individual protein variants and generates the type of dataset needed to train machine learning models,” said Linqi Cheng, a graduate student at Rice University and lead author of the study. “The model was then able to predict mutations that significantly improved the activity of the proteins we were studying.”
The research team chose a small CRISPR-Cas protein for their proof of concept. Although this protein was valued for its size, it had limited activity against stretches of DNA targeted for cleavage. The researchers wanted to identify a version that could cut a wider range of DNA targets.
First, they mutated the DNA encoding the Cas9 protein, creating many variations. Each variant had an empty DNA barcode attached to it, as well as a special editor that changed the barcode depending on the protein’s activity level. As the protein activity level increased, the editor activity level also increased. This means that the most active protein variations have the greatest changes in their barcodes. The DNA barcode is then read by next-generation sequencing, which essentially scans the barcode and categorizes each sequence by activity level.
“AI does not replace experimentation here; rather, it depends on experimentation,” Chen said. “Sequence Display provides us with a data foundation and the model helps us search a much larger data space for strong candidates.”
The researchers were able to repeat the process using other proteins, including aminoacyl-tRNA synthetase, cytosine deaminase, and uracil glycosylase inhibitors. In both cases, the barcoding experiments generated enough data points to train an AI model.
“What this approach provides is a practical framework for integrating AI and protein engineering,” said Xiao, who is also a scholar at the Cancer Prevention Research Institute. “Rather than relying on machine learning as a standalone solution, we combine it with experimental platforms that generate high-quality training data. This synergy enables advanced research tools and more efficient discovery of next-generation therapeutic proteins.”
This research was supported by the SynthX Seed Award (SYN-IN-2024-002), the National Institutes of Health (R35-GM133706, R01-CA277838, R01-AI165079 to HX), the Robert A. Welch Foundation (C-1970 to HX), and the U.S. Department of Defense (W81XWH-21-1-0789; HT9425-23-1-0494, HT9425-25-1-0021 to HX), a 2024 Rice Synthetic Biology Institute Seed Grant (HX), and a Medical Research Award from the Robert J. Kleberg Jr. and Helen C. Kleberg Foundation.
sauce:
Reference magazines:
Chen, L. Others. (2026). Sequence Display enables large sequence activity datasets for rapid protein evolution. nature biotechnology. DOI: 10.1038/s41587-026-03087-3. https://www.nature.com/articles/s41587-026-03087-3

