← back to domains
§ 01 · domain

Foundation Models for Biology

Foundation Models for Biology (the architectural shift underlying everything else)

DNA/genome foundation models

8 tools
Caduceus
Cornell/Goldstein

Mamba/SSM-based, bidirectional, reverse-complement equivariant, outperforms 10× larger transformers on long-range tasks.

Evo
Arc Institute

Evo 2 has 40B parameters, 1 megabase context length, trained on 9 trillion nucleotides spanning eukaryotes and prokaryotes. Single-nucleotide resolution via StripedHyena architecture (hybrid attention + signal-processing operators). Capable of zero-shot prediction *and* whole-gen

Evo 2
Arc Institute

variant effect across DNA, RNA, and protein in a single model.

GENERator

various long-context generative DNA models.

GenomeOcean

various long-context generative DNA models.

HyenaDNA

various long-context generative DNA models.

Nucleotide Transformer
InstaDeep

transformer family trained on multispecies genomes; strong on regulatory element prediction.

megaDNA

various long-context generative DNA models.

Protein language models

7 tools
AMPLIFY

alternative PLMs with different scaling/data tradeoffs.

ESM series
Meta → EvolutionaryScale

ESM-1b, ESM-2 (15B params), **ESM-3** (multimodal: jointly reasons over sequence, structure, and function via discrete tokenization; published in *Science* 2025), ESM-C (efficient successor). The dominant family.

OmegaPLM

alternative PLMs with different scaling/data tradeoffs.

ProGen2

alternative PLMs with different scaling/data tradeoffs.

ProtGPT2
RostLab

T5/GPT-style architectures for protein.

ProtT5
RostLab

T5/GPT-style architectures for protein.

ProtTrans
RostLab

T5/GPT-style architectures for protein.

RNA foundation models

3 tools
DRfold2

the active toolkit. Accuracy degrades sharply on novel folds without homologs.

RNA-FM

language model + structure prediction for RNA; RhoFold+ is currently SOTA for single-RNA tertiary prediction.

RhoFold+

the active toolkit. Accuracy degrades sharply on novel folds without homologs.

Single-cell / transcriptome foundation models

4 tools
Geneformer
Universal Cell Embedding

transformer-based, trained on 30-100M cells.

UCE
Universal Cell Embedding

transformer-based, trained on 30-100M cells.

scFoundation
Universal Cell Embedding

transformer-based, trained on 30-100M cells.

scGPT
Universal Cell Embedding

transformer-based, trained on 30-100M cells.