Skip to content

Model Card for Geneformer

Model Details

Model Name: Geneformer
Model Versions: 1.0 and 2.0
Model Description: Geneformer is a context-aware, attention-based deep learning model pretrained on a large-scale corpus of single-cell transcriptomes. It is designed to enable context-specific predictions in settings with limited data in network biology. The model performs various downstream tasks such as gene network mapping, disease modeling, and therapeutic target identification.

In version 2.0, Geneformer introduces a cancer-tuned model variant using domain-specific continual learning. This variant was developed to address the exclusion of malignant cells from the initial pretraining due to their propensity for gain-of-function mutations. The cancer-tuned model underwent additional training with ~14 million cells from cancer studies, including matched healthy controls, to provide contrasting context. This approach allows the model to better understand gene network rewiring in malignancy while maintaining its general knowledge of gene network dynamics.

When to use each model: - Base pretrained model: Use for general transcriptomic analysis tasks and non-cancer-specific applications. - Cancer-tuned model: Use for cancer-specific analyses, tumor microenvironment studies, and predicting factors that could shift cells to tumor-restricting or immune-activating states.

Model Versions

Geneformer has two main versions:

Version 1.0:

  • Pretrained on approximately 30 million single-cell transcriptomes
  • Input size of 2048 genes per cell
  • Focused on single-task learning

Version 2.0:

  • Pretrained on Genecorpus-103M, comprising ~103 million human single-cell transcriptomes
  • Initial self-supervised pretraining with ~95 million cells, excluding cells with high mutational burdens
  • Expanded input/context size of 4096 genes per cell
  • Employs multi-task learning to jointly learn cell types, tissues, disease states, and developmental stages
  • Includes a cancer-tuned model variant using domain-specific continual learning
  • Supports model quantization for resource-efficient fine-tuning and inference

Key improvements in v2.0: - Larger and more diverse pretraining corpus - Increased model parameters and expanded input size - Multi-task learning for context-specific representations of gene network dynamics - Improved zero-shot predictions in diverse downstream tasks - Cancer-specific tuning for tumor microenvironment analysis

Available Models for each Version

Version 1.0 (30M dataset)

  • gf-6L-30M-i2048
    • 6 layers
    • 2048 input size
    • Trained on ~30 million cells
  • gf-12L-30M-i2048
    • 12 layers
    • 2048 input size
    • Trained on ~30 million cells

Version 2.0 (95M dataset)

  • gf-12L-95M-i4096
    • 12 layers
    • 4096 input size
    • Trained on ~95 million cells
  • gf-20L-95M-i4096
    • 20 layers
    • 4096 input size
    • Trained on ~95 million cells
  • gf-12L-95M-i4096-CLcancer
    • 12 layers
    • 4096 input size
    • Initially trained on ~95 million cells
    • Further tuned on ~14 million cancer cells

Model Developers

Developed by: Christina V. Theodoris conceived of the work, developed Geneformer, assembled Genecorpus-30M and designed and performed computational analyses.

Contact Information: christina.theodoris@gladstone.ucsf.edu

License: Apache-2.0

Model Type

Architecture: Transformer-based

Domain: Cell Biology, Bioinformatics

Input Data: Single-cell transcriptomes

Model Purpose

Technical usage:

  • Tokenizing transcriptomes
  • Pre-training
  • Hyperparameter tuning
  • Fine-tuning
  • Extracting and plotting cell embeddings
  • In silico perturbation

Broader research applications:

  • Research in genomics and network biology
  • Disease modeling with limited patient data
  • Identification of candidate therapeutic targets
  • Prediction of gene dosage sensitivity and chromatin dynamics
  • Context-specific predictions in gene regulatory networks

Training Data

Data Sources:

  • Publicly available single-cell transcriptomic datasets
  • GeneCorpus-30M is available on the Hugging Face Dataset Hub
  • Genecorpus-103M will be available on Hugging Face Dataset Hub (coming soon)

Data Volume:

  • Version 1.0: 29.9 million single-cell transcriptomes across a wide range of tissues
  • Version 2.0: ~103 million human single-cell transcriptomes (Genecorpus-103M), including:
    • ~95 million cells for initial self-supervised pretraining
    • ~14 million cells from cancer studies for domain-specific continual learning

Preprocessing:

  • Exclusion of cells with high mutational burdens
  • Metrics established for scalable filtering to exclude possible doublets and/or damaged cells
  • Rank value encoding of transcriptomes where genes are ranked by scaled expression within each cell

Model Performance

Evaluation Metrics:

  • Predictive accuracy in distinguishing:
    • With fine-tuning:
      • Transcription factor dosage sensitivity
      • Chromatin dynamics (bivalently marked promoters)
      • Transcription factor regulatory range
      • Gene network centrality
      • Transcription factor targets
      • Cell type annotation
      • Batch integration
      • Cell state classification across differentiation
      • Disease classification
      • In silico perturbation to determine disease-driving genes
      • In silico treatment to determine candidate therapeutic targets
    • With Zero-shot learning:
      • Batch integration
      • Gene context specificity
      • In silico reprogramming
      • In silico differentiation
      • In silico perturbation to determine impact on cell state
      • In silico perturbation to determine transcription factor targets
      • In silico perturbation to determine transcription factor cooperativity

Testing Data:

  • Held-out subsets of the training dataset
  • Additional validation using publicly available datasets
  • Experimental validation for:
    • Prediction of novel transcription factor in cardiomyocytes with zero-shot learning that had a functional impact on cardiomyocytes' contractile force generation
    • Prediction of candidate therapeutic targets with in silico treatment analysis that significantly improved contractile force generation of cardiac microtissues in an iPS cell model of cardiomyopathy

Model Limitations

Known Limitations:

  • May not generalize well to newly discovered tissues or rare gene variants
  • Performance may vary across different single-cell sequencing technologies

Future Improvements:

  • Integration of new data sources to improve model robustness
  • Enhancements in model architecture to better handle diverse transcriptomic profiles

How to Use

Input Format:

  • Rank value encoded single-cell transcriptomes

Output Format:

  • Contextual gene and cell embeddings
  • Contextual attention weights
  • Contextual predictions

Example Usage:

from helical.models.geneformer import Geneformer, GeneformerConfig
import anndata as ad

# Example configuration
model_config = GeneformerConfig(model_name="gf-12L-95M-i4096", batch_size=10)
geneformer_v2 = Geneformer(model_config)

# Example usage for base pretrained model
ann_data = ad.read_h5ad("anndata_file.h5ad")
dataset = geneformer_v2.process_data(ann_data)
embeddings = geneformer_v2.get_embeddings(dataset)
print("Base model embeddings shape:", embeddings.shape)

# Example usage for cancer-tuned model
model_config_cancer = GeneformerConfig(model_name="gf-12L-95M-i4096-CLcancer", batch_size=10)
geneformer_v2_cancer = Geneformer(model_config)

cancer_ann_data = ad.read_h5ad("anndata_file.h5ad")
cancer_dataset = geneformer_v2_cancer.process_data(cancer_ann_data)
cancer_embeddings = geneformer_v2_cancer.get_embeddings(cancer_dataset)
print("Cancer-tuned model embeddings shape:", cancer_embeddings.shape)

Example Fine-Tuning:

from helical.models.geneformer import GeneformerConfig, GeneformerFineTuningModel
import anndata as ad

# Load the data
ann_data = ad.read_h5ad("/home/matthew/helical-dev/helical/yolksac_human.h5ad")

# Get the column for fine-tuning
cell_types = list(ann_data.obs["cell_types"])
label_set = set(cell_types)

# Create a GeneformerConfig object
geneformer_config = GeneformerConfig(model_name="gf-12L-95M-i4096", batch_size=10)

# Create a GeneformerFineTuningModel object
geneformer_fine_tune = GeneformerFineTuningModel(geneformer_config=geneformer_config, fine_tuning_head="classification", output_size=len(label_set))

# Process the data
dataset = geneformer_fine_tune.process_data(ann_data[:10])

# Add column to the dataset
dataset = dataset.add_column('cell_types', cell_types)

# Create a dictionary to map cell types to ids
class_id_dict = dict(zip(label_set, [i for i in range(len(label_set))]))

def classes_to_ids(example):
    example["cell_types"] = class_id_dict[example["cell_types"]]
    return example

# Convert cell types to ids
dataset = dataset.map(classes_to_ids, num_proc=1)

# Fine-tune the model
geneformer_fine_tune.train(train_dataset=dataset, label="cell_types")

# Get logits from the fine-tuned model
outputs = geneformer_fine_tune.get_outputs(dataset)
print(outputs[:10])

# Get embeddings from the fine-tuned model
embeddings = geneformer_fine_tune.get_embeddings(dataset)
print(embeddings[:10])

Contact

christina.theodoris@gladstone.ucsf.edu

Citation

@article{theodoris2023transfer,
  title={Transfer learning enables predictions in network biology},
  author={Theodoris, Christos V and Xiao, Liang and Chopra, Arun and Chaffin, Mark D and Al Sayed, Zainab R and Hill, Michael C and Mantineo, Hannah and Brydon, Ellen M and Zeng, Zexian and Liu, X Shirley and Ellinor, Patrick T},
  journal={Nature},
  volume={618},
  pages={616--624},
  year={2023},
  doi={10.1038/s41586-023-06139-9}
}
@article{chen2024quantized,
  title={Quantized multi-task learning for context-specific representations of gene network dynamics},
  author={Chen, H and Venkatesh, M S and Gomez Ortega, J and Mahesh, S V and Nandi, T and Madduri, R and Pelka, K and Theodoris, C V},
  journal={bioRxiv},
  year={2024},
  month={aug},
  day={19},
  doi={10.1101/2024.08.16.608180},
  note={co-first authors: H Chen, M S Venkatesh; co-senior authors: K Pelka, C V Theodoris; corresponding author: C V Theodoris}
}

Author contributions

C.V.T. conceived of the work, developed Geneformer, assembled Genecorpus-30M and designed and performed computational analyses. L.X., A.C., Z.R.A.S., M.C.H., H.M. and E.M.B. performed experimental validation in engineered cardiac microtissues. M.D.C. performed preprocessing, cell annotation and differential expression analysis of the cardiomyopathy dataset. Z.Z. provided data from the TISCH database for inclusion in Genecorpus-30M. X.S.L. and P.T.E. designed analyses and supervised the work. C.V.T., X.S.L. and P.T.E.

HC and MSV developed the models and designed/performed computational analyses. HC assembled the pretraining corpuses and developed the continual learning method. MSV developed the multi-task learning method and quantization strategy. JGO contributed to model pretraining and corpus assembly.