Scientists at the Icahn School of Medicine at Mount Sinai have created a new artificial intelligence (AI) model that helps reveal how genes work together in human cells, providing a powerful new way to understand biology and disease.
The study was published online May 21. pattern, cell press Journal (https://doi.org/10.1016/j.patter.2026.101565) introduces the gene set-based model (GSFM), which is designed to learn patterns about how genes are grouped and function across thousands of biological contexts. This research is inspired by advances in large-scale language models (LLMs) such as ChatGPT, which learn how words acquire meaning depending on their context. In a similar way, GSFM learns how genes behave differently depending on the “context” of the cell.
Genes rarely act alone. Instead, they participate in multiple biological processes and form different molecular groups depending on where and when they are active within the cell. Just as words can have different meanings in different sentences, a single gene can play different roles in different environments. We asked whether AI could learn the “meaning” of genes in the same way that modern language models learn the meanings of words from context. Our GSFM was designed to do just that. ”
Dr. Avi Mayan, senior author, Professor of Pharmacology and Director of the Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai
This model provides a new way to understand the structural and functional organization of genes and their products within human cells. This improved understanding may ultimately support the development of better diagnostics, biomarkers, and treatments. By mapping how genes are related to each other across many biological contexts, GSFM creates a reference framework that helps scientists more effectively interpret complex multi-omic data sets, researchers say.
”The organization of genes within cells is one of the major unsolved problems in biology. GSFM helps address this problem by learning from millions of gene groups derived from published research and gene expression datasets,” says Dr. Ma’ayan.
The model can:
- Helps identify poorly understood gene functions without immediate laboratory experiments
- Highlight genes involved in disease processes
- Suggest potential new drug targets and biomarkers
- It provides a reusable knowledge system for different types of biomedical research data analysis tasks, including improved gene set enrichment analysis.
Essentially, the researchers say, GSFM provides a new “map” of how genes work together in different situations.
To build the model, the researchers compiled a set of millions of genes from published scientific studies and gene expression datasets. In total, the system learned from hundreds of thousands of independent research efforts.
The AI model was trained in a way similar to solving puzzles. They were given a portion of a gene set and asked to predict the missing pieces. Over time, we learned the underlying patterns that explain how genes are grouped and interact.
The AI model was then benchmarked against other approaches and demonstrated superior performance, including the ability to identify gene-to-gene and gene-function relationships before they were confirmed in experiments. To assess this, we trained a model using gene sets from publications up to a defined cutoff date and tested whether it could predict findings reported in studies published after that cutoff date.
“Unlike previous biological AI models that rely primarily on gene expression data, our GSFM is uniquely trained on a different and underutilized type of biological information: gene sets,” says Dr. Ma’ayan. “This approach allows the model to integrate diverse data from many diseases, experimental methods, and study conditions to create a unified representation of genetic relationships across biology.”
GSFM has the potential to enhance existing bioinformatics tools and improve the interpretation of data collected with omics technologies. One immediate application is gene set enrichment analysis, a method widely used in molecular biology research. By improving the way scientists interpret gene groups, this model could help uncover new biological insights from both existing and future datasets.
The research team plans to expand the system by combining GSFM with other AI-based models. One goal is to integrate this with language-based models to generate natural language descriptions of gene function. Another future direction is to combine GSFM with drug-focused AI models, with the long-term goal of predicting how drugs interact with cells and supporting the design of new therapeutics.
sauce:
Mount Sinai Health System
Reference magazines:
Clark, DJB, others. (2026). GSFM: Gene set-based model pre-trained on a large collection of diverse gene sets. pattern. DOI: 10.1016/j.patter.2026.101565. https://www.cell.com/patterns/fulltext/S2666-3899(26)00074-7

