Model Card for Universal Cell Embedding (UCE)
Model Details
Model Name: Universal Cell Embedding (UCE)
Model Version: 1.0
Model Description: A large-scale self-supervised transformer-based model, pre-trained across more than 36 million cells for creating the Integrated Mega-scale Atlas with more than 1,000 uniquely named cell types, from hundreds of experiments, dozens of tissues and eight species. UCE represents the sample’s genes by their protein products, using a large protein language model. This allows UCE to meaningfully represent any gene, from any species, regardless of whether the species had appeared in the training data. The model enables cell type annotation prediction, hypothesis generation, disease state comparison, new data mapping, integration of diverse single-cell datasets, and opens the door for discovery of novel cell type functions
Model Developers
Developed By: Yanay Rosen and Yusuf Roohani conceived the study, performed research, contributed to new analytical tools, designed algorithmic frameworks, analyzed data, performed experiments and developed the software. Other author contributions
Contact Information: jure@cs.stanford.edu, quake@stanford.edu
License: MIT License Copyright (c) 2023 Yanay Rosen, Yusuf Roohani, Jure Leskovec
Model Type
Architecture: Transformer-based
Domain: Cell Biology, Bioinformatics
Input Data: Single-cell transcriptomics data
Model Purpose
Technical usage:
- Tokenizing genes corresponding to its protein product
- Pre-training
- Running in Zero-shot setting
- Extracting and plotting cell embeddings
Broader research applications:
- Designed to address questions in cell and molecular biology
- Generation of representations of new single-cell expression datasets with no model fine-tuning or retraining while still remaining robust to dataset and batch-specific artifacts
- Cell type prediction in large single-cell datasets with no additional model retraining
- Mapping of new data into a universal embedding space that aligns cell types across tissues and species
- Hypothesis generation in biological research
- Novel cross-dataset discoveries
Training Data
Data Sources:
- Public single-cell transcriptomic datasets (e.g., CellXGene, various GEO datasets)
- Data from multiple species (including humans, mice, lemurs, zebrafish, pigs, monkeys, and frogs) and tissues to ensure diversity
- Download the full list of datasets used to train UCE here
Data Volume:
- Trained across more than 300 datasets consisting of over 36 million cells and more than 1,000 different cell types
Preprocessing:
The creation of the Integrated Mega-scale Atlas involved filtering: - Duplicate cells by selecting primary cells only - For datasets from the CxG Census, cells by minimum gene counts (200) and genes by a minimum cells count of 10. No highly variable gene selection was applied
Model Performance
Evaluation Metrics:
- Zero-shot embedding quality and clustering using metrics from the single-cell integration benchmark
- Cell type organization
- Comparison to cell ontology
- Zero-shot cell type alignment to Integrated Mega-scale Atlas
Testing Data:
- Held-out subsets of training datasets
- External validation using diverse single-cell datasets
Ethical Considerations
Bias and Fairness:
- Inclusion of diverse species and cell types to minimize bias
- Continuous evaluation for potential biases
Model Limitations
Known Limitations:
- Analyses and corresponding benchmarks are generally limited by their emphasis on broad, coarse-grained cell type labels
- Current scRNA-seq foundation models, including UCE, do not utilize the detailed information contained in the raw RNA transcripts
Future Improvements:
- New analyses and benchmarks should focus on more detailed, fine-grained cell type classifications
- Incorporation of genomic precision at the transcript level
- Simulation of the biological processes of cells, leading to the creation of "Virtual Cells"
How to Use
Input Format:
- UCE takes as an input (1) scRNA-seq count data (cell by gene count matrix) and (2) the corresponding protein embeddings, generated by a large protein language model, [ESM2] (https://www.science.org/doi/10.1126/science.ade2574?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%20%200pubmed), for the genes in the dataset
Output Format:
- JSON or h5ad format with cell type annotations and embeddings
Example Usage:
from helical.models.uce import UCE, UCEConfig
import anndata as ad
configurer=UCEConfig(batch_size=10)
uce = UCE(configurer=configurer)
ann_data = ad.read_h5ad("dataset.h5ad")
data_loader = uce.process_data(ann_data)
embeddings = uce.get_embeddings(data_loader)
print(embeddings.shape)
- Download processed datasets used in the paper here
Example Fine-Tuning:
from helical.models.uce import UCEConfig, UCEFineTuningModel
import anndata as ad
# Load the data
ann_data = ad.read_h5ad("dataset.h5ad")
# Get unique output labels
label_set = set(cell_types)
# Create the fine-tuning model with the desired configs
configurer=UCEConfig(batch_size=10)
uce_fine_tune = UCEFineTuningModel(uce_config=configurer, fine_tuning_head="classification", output_size=len(label_set))
# Process the data for training
dataset = uce_fine_tune.process_data(ann_data)
# Get the desired label class
cell_types = list(ann_data.obs.cell_type)
# Create a dictionary mapping the classes to unique integers for training
class_id_dict = dict(zip(label_set, [i for i in range(len(label_set))]))
for i in range(len(cell_types)):
cell_types[i] = class_id_dict[cell_types[i]]
# Fine-tune
uce_fine_tune.train(train_input_data=dataset, train_labels=cell_types)
Contact
jure@cs.stanford.edu, quake@stanford.edu
Citation
@article{rosen2023universal,
title={Universal Cell Embeddings: A Foundation Model for Cell Biology},
author={Rosen, Yuri and Roohani, Yusuf and Agarwal, Akshay and Samotorčan, Luka and {Tabula Sapiens Consortium} and Quake, Stephen R and Leskovec, Jure},
journal={bioRxiv},
year={2023},
doi={10.1101/2023.11.28.568918}
}
Author contributions
Y.RS., Y.RH., S.Q. and J.L. conceived the study. Y.RS, Y.RH., S.Q. and J.L. performed research, contributed new analytical tools, designed algorithmic frameworks, analyzed data and wrote the manuscript. Y.RS. and Y.RH. performed experiments and developed the software. A.A. and L.S. contributed to code and performed analyses. T.S. provided annotated data.