The number of known proteins is infinitely small compared to the world of theoretically possible proteins. However, these known proteins are the only major training ground for future protein design. Therefore, understanding how representative these proteins are within their overall potential diversity can help inform strategies for a wide range of applications, including the development of therapeutics, biocatalysts, or biomaterials.
Published in PNASan international team from the Okinawa Institute of Science and Technology Graduate University (OIST), the Austrian Institute of Science and Technology (ISTA), the University of Vienna, and the Center for Astrobiological Sciences (CAB) investigated the relationship between protein evolution and sequence space and identified the limiting factors behind protein diversification. Their findings support the theory of DNA recombination as the driving force for ancestral protein formation and highlight the limitations of many state-of-the-art AI protein design methods.
Modern AI methods are thought to be revolutionizing protein design, with the 2024 Nobel Prize in Chemistry awarded to the team behind AlphaFold. However, most of these AI design methods are typically trained on databases of known proteins. So without understanding how representative the sequence space is of these known proteins, how confident can we be that we can generate truly diverse protein designs in such a way?”
Professor Fyodor Kondrashov, Head of the OIST Evolutionary and Synthetic Biology Unit
Explore the world of proteins
Imagine you have 20 or so different types of blocks that can be connected in different orders and quantities to create chains tens, hundreds, or even thousands of blocks long. Mapping all possible resulting chains creates a sequence space.
In the case of proteins, the shape and structure of their amino acid building blocks mean that only a small fraction of possible protein sequences can fold into the correct 3D shape to perform biological function. They need the right chemical groups in the right places to create interactions that maintain their 3D shape or bind to other molecules. Mapping arrays that meet this requirement creates a smaller feature space.
Relatively few of these possible functional sequences are likely to have ever existed throughout evolutionary history. Therefore, researchers set out to uncover how this subset of proteins represents the functional space.
The researchers began by mathematically describing the sequence space occupied by known proteins. They then constructed a model of protein evolution to understand the biological factors that control the structural diversification of a wide range of natural protein families. They then predicted from their model how many functional sequences would be expected to exist for a given biological function.
By comparing known protein diversity with theoretical predictions of protein evolution, the researchers found that the influence of point of origin far outweighs the influence of other important evolutionary processes.
“It’s not necessarily surprising that a major evolutionary limit is a starting point, but the scale of its importance is actually quite remarkable,” said Rada Isakova, PhD student and first author of the unit. “As an evolutionary biologist, I was intrigued to learn how unimportant selection and epistasis are in our results.”
What limits protein evolution?
Mutations in the genes that code for particular proteins can change the sequence of amino acids produced, leading to protein evolution. Natural selection limits the mutations that persist over time based on whether they improve or impair protein function or stability. Epistasis (genetic interactions that result in different outputs) also constrains evolution. This is because a mutation may have a limited effect on its own, but a large effect when present in combination with certain other mutations.
Although both selection and epistasis are known to influence protein evolution, Isakova et al. found that the factor limiting protein diversity is much more the origin of our proteins, with relatively small differences seen from regions of sequence space in ancestral proteins.
This study provides new insights into the origin of life and strengthens existing theories about early protein formation. Professor Isakova explains: “Our simulations suggest that for the first proteins of the last universal common ancestor to arise, given time constraints, they cannot simply diverge from mutations in a single initial sequence. Instead, small pieces of DNA must have been shuffled and recombined to create new DNA molecules that may code for very different proteins.”
The research team also hopes this work will inspire experimental scientists to expand the known sequence space. Isakova commented, “Neural network approaches for predicting functional proteins are limited by the datasets we provide. Therefore, based on existing data, most methods cannot generalize beyond the current known sequence space. We see that there is still a huge area of sequence space left to explore, but new experimental data are needed to enable expansion into these unknown regions.”
This global collaboration was supported by a Sustainable Partnership for Innovative Research Ecosystems (ASPIRE) grant from the Japan Science and Technology Agency (JST). This grant aims to build networks between top researchers in Japan and the world and develop future scientific leaders.
sauce:
Okinawa Institute of Science and Technology (OIST) Graduate School
Reference magazines:
DOI: 10.1073/pnas.2532018123

