Model Card for scGPT
Model Details
Model Name: scGPT
Model Version: 1.0
Model Description: scGPT is a large-scale self-supervised transformer-based model, pre-trained across more than 33 million human cells under non-disease conditions. It is designed to perform various tasks, including cell type annotation, multi-batch integration, multi-omic integration, in silico perturbation response prediction, and gene regulatory network inference. The model is pre-trained on extensive single-cell RNA sequencing data to build a foundational understanding of cellular biology.
Model Developers
Developed By: Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, Bo Wang. See specific author contributions
Contact Information: Bo Wang (bowang@vectorinstitute.ai)
License: MIT License Copyright (c) 2022 suber
Model Type
Architecture: Transformer-based
Domain: Cell Biology, Bioinformatics
Languages: Single-cell transcriptomics data
Model Purpose
Technical usage:
- Tokenizing transcriptomes
- Tokenizing conditions (i.e. meta-information associated with individual genes, like perturbation experiment alterations, which are indicated by perturbation tokens)
- Pre-training
- Fine-tuning
- Extracting and plotting cell embeddings
Broader research applications:
- Cell type annotation
- Perturbation response prediction
- Batch correction on integrating multiple scRNA-seq datasets
- Integrative representation learning for single-cell multi-omic data
- Gene regulatory network inference
Training Data
Data Sources:
- Publicly available datasets are described in data availability in the manuscript
Data Volume:
- Pre-trained on data from over 33 million human cells under non-disease conditions. This comprehensive dataset encompasses a wide range of cell types from 51 organs or tissues, and 441 studies
Preprocessing:
- Normalization and scaling to ensure consistency across datasets
- Value binning technique to convert all expression counts into relative values
Model Performance
Evaluation Metrics:
- Classification metrics: Accuracy, Precision, Recall, Macro F1
- Biological conservation metrics: NMIcell, ARIcell, ASWcell
- Batch correction metrics: ASWbatch, GraphConn
Testing Data:
- Held-out subsets of the training dataset
- Additional external validation datasets from independent studies
Model Limitations
Known Limitations:
- The current pretraining does not inherently mitigate batch effects, and thus the model’s zero-shot performance could be constrained on datasets with substantial technical variation
- Evaluating the model is also complex, given the frequent absence of definitive biological ground truths and the variation in data quality
Future Improvements:
- Pretraining on a larger-scale dataset with more diversity, including multi-omic data, spatial omics and various diseased conditions
- Incorporation of perturbation and temporal data in the pretraining stage, enabling the model to learn causal relationships and infer how genes and cells respond to changes over time
- Development of techniques that allow the pretrained model to understand and adapt to different tasks and contexts in a zero-shot setting without the need for fine-tuning
How to Use
Input Format:
- The input to scGPT consists of three main components: (1) gene (or peak) tokens, (2) expression values (cell-by-gene matrix) and (3) condition tokens
Output Format:
- Gene and cell embeddings, JSON format with predicted cell types and integrated multi-modal data
Example Usage:
from helical.models.scgpt import scGPT, scGPTConfig
import anndata as ad
scgpt_config = scGPTConfig(batch_size=10)
scgpt = scGPT(configurer = scgpt_config)
adata = ad.read_h5ad("dataset.h5ad")
data = scgpt.process_data(adata)
embeddings = scgpt.get_embeddings(data)
print(embeddings.shape)
Example Fine-Tuning:
from helical.models.scgpt import scGPTFineTuningModel, scGPTConfig
# Load the desired dataset
adata = ad.read_h5ad("dataset.h5ad")
# Get the desired label class
cell_types = list(ann_data.obs.cell_type)
# Get unique labels
label_set = set(cell_types)
# Create the fine-tuning model with the relevant configs
scgpt_config=scGPTConfig(batch_size=10)
scgpt_fine_tune = scGPTFineTuningModel(scGPT_config=scgpt_config, fine_tuning_head="classification", output_size=len(label_set))
# Process the data for training
data = scgpt_fine_tune.process_data(adata)
# Create a dictionary mapping the classes to unique integers for training
class_id_dict = dict(zip(label_set, [i for i in range(len(label_set))]))
for i in range(len(cell_types)):
cell_types[i] = class_id_dict[cell_types[i]]
# Fine-tune
scgpt_fine_tune.train(train_input_data=dataset, train_labels=cell_types)
Developers
Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, Bo Wang
Contact
- Bo Wang (bowang@vectorinstitute.ai)
Citation
To cite the scGPT model, please use the following reference:
@article{cui2023scGPT,
title={scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI},
author={Cui, Haotian and Wang, Chloe and Maan, Hassaan and Pang, Kuan and Luo, Fengning and Wang, Bo},
journal={bioRxiv},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
For more details and updates, visit the scGPT GitHub repository.
Author contributions
H.C. developed the concept of the work and contributed to design and implementation of the algorithm. C.W. and K.P. contributed to design and implementation of the algorithm. H C., C.W., H.M., K.P. and F.L. contributed to the analysis of computational experiments. H.C. and C.W. drafted the initial version of the manuscript. H.C., C.W., H.M., K.P., F.L. and B.W. contributed to revision of the work. N.D. contributed to design of the algorithm. B.W. contributed to the conception and design of the work.