Imagine a world where you can predict your life’s actions by analyzing a series of letters. This is not a world of science fiction or magic, but a real world where scientists have long tried to achieve this goal. These sequences consisting of four nucleotides (a, t, c, and g) contain basic instructions for life on Earth, from the smallest microorganisms to the largest mammals. Deciphering these sequences can unlock complex biological processes and translate areas such as personalized medicine and environmental sustainability.
However, despite this immeasurable possibility, decoding even the simplest microbial genome is a very complicated task. These genomes are made up of millions of DNA base pairs that regulate the interactions between DNA, RNA, and proteins. These are three key elements of the central dogma of molecular biology. This complexity creates vast fields of genetic information that exist at multiple levels, from individual molecules to the entire genome, and evolved over billions of years.
Traditional computational tools struggle to handle the complexity of biological sequences. However, the rise in generation AI has enabled us to expand trillions of sequences and understand complex relationships across token sequences. Based on this advancement, researchers at ARC Institute, Stanford University, and Nvidia have been working on building AI systems that can understand biological sequences, such as large-scale language models. Now they have made groundbreaking developments by creating models that capture both the multimodal nature of central dogmas and the complexity of evolution. This innovation could lead to prediction and design of new biological sequences, ranging from individual molecules to the entire genome. This article discusses how this technology works, its potential applications, the challenges it faces, and the future of genomic modeling.
EVO 1: A pioneering model in genome modeling
This study attracted attention in late 2024 when Nvidia and its collaborators introduced EVO 1, a groundbreaking model for analyzing and generating biological sequences across DNA, RNA and proteins. Trained in 2.7 million prokaryotes and phage genomes, a total of 300 billion nucleotide tokens focused on the integration of central dogmas in molecular biology and modeled the flow of genetic information from DNA to RNA. A hybrid model using convolutional filters and gates, the Stripedhyena architecture efficiently handled long contexts of up to 131,072 tokens. This design allowed EVO 1 to link small sequence changes to broader system-wide and biological-level effects, bridging the gap between molecular biology and evolutionary genomics.
EVO 1 was the first step in computational modeling of biological evolution. Molecular interactions and genetic variation were predicted by analyzing evolutionary patterns of genetic sequences. However, as scientists aimed to apply it to more complex eukaryotic genomes, limitations of the model became apparent. EVO 1 struggled with single nucleotide resolution across long DNA sequences and was computationally expensive for larger genomes. These challenges required more sophisticated models that could integrate biological data across multiple scales.
EVO 2: Basic Models for Genomic Modeling
Based on lessons learned from EVO-1, researchers launched EVO 2 in February 2025, moving on to the field of biological sequence modeling. Trained with an incredible 9.3 trillion DNA base pairs, this model has learned to understand and predict the functional consequences of genetic variation in all areas of life, including bacteria, archaeal, plants, fungi, and animals. The EVO-2 model has over 40 billion parameters and can handle up to 1 million pairs of unprecedented sequence lengths that previous models, including EVO-1, could not manage.
What distinguishes EVO 2 from its predecessor is the entire central doctrine of molecular biology, which is its ability to model interactions between DNA, RNA, and proteins, as well as DNA sequences. This allows EVO 2 to accurately predict the effects of genetic mutations, from minimal nucleotide changes to greater structural changes, in a way that was previously impossible.
An important feature of EVO 2 is its powerful zero-shot prediction feature that allows it to predict the functional effects of mutations without the need for task-specific fine-tuning. For example, by analyzing only DNA sequences, we accurately classify clinically important BRCA1 variants, a key factor in breast cancer research.
Potential applications of biomolecular science
The power of EVO 2 opens new frontiers of genomics, molecular biology and biotechnology. Some of the most promising applications include:
- Healthcare and drug discovery: EVO 2 can predict which gene variants are associated with a particular disease, helping to develop targeted therapies. For example, in a test containing a variant of the breast cancer-associated gene BRCA1, EVO 2 achieved more than 90% accuracy in predicting which mutations were benign versus pathogenic. Such insights could accelerate the development of new drugs and personalized therapies. になったんです。 English: The first thing you can do is to find the best one to do.
- Synthetic biology and genetic engineering: The ability of EVO 2 to generate whole Genomes opens new pathways in designing synthetic organisms with desirable properties. Researchers are using EVO 2 to engineer genes with specific functions and develop biofuels, environmentally friendly chemicals, and new therapeutics.
- Agricultural Biotechnology: Can be used to design GMO crops with improved properties such as drought resistance and pest resilience, contributing to global food security and agriculture sustainability.
- Environmental Science: EVO 2 can be applied to biofuels or engineer proteins that break down environmental pollutants such as oil and plastics and contribute to sustainability efforts.
Challenges and future directions
Despite its impressive capabilities, the EVO 2 faces challenges. One important hurdle is the computational complexity involved in training and execution of the model. With a context window of 1 million pairs and 40 billion parameters, EVO 2 requires important computational resources to function effectively. This makes it difficult for small researchers to fully utilize their possibilities without accessing high-performance computing infrastructure.
Furthermore, while EVO 2 is excellent at predicting genetic mutation effects, there is still much to learn about how to use it to design new biological systems from scratch. Generating realistic biological sequences is just the first step. The real challenge lies in understanding how to use this power to create functional and sustainable biological systems.
Accessibility and democratization of AI in genomics
One of the most exciting aspects of EVO 2 is its open source availability. To democratize access to advanced genomic modeling tools, NVIDIA publishes model parameters, training codes, and datasets. This open-access approach allows researchers around the world to explore and expand the capabilities of EVO 2, accelerating innovation across the scientific community.
Conclusion
EVO 2 is an important advance in genomic modeling to decode complex genetic languages of life using AI. The ability to model DNA sequences and interactions with RNA and proteins open up new possibilities in healthcare, drug discovery, synthetic biology and environmental science. EVO 2 predicts genetic variation, designs new biological sequences, and offers the potential for personalized medicine and transformation of sustainable solutions. However, the complexity of the calculations poses challenges, especially for small research teams. By creating EVO 2 open source, Nvidia allows researchers around the world to explore and expand their capabilities and promote genomics and biotechnology innovation. As technology continues to evolve, it has the potential to reconstruct the future of biological sciences and environmental sustainability.