Model Card for Universal Cell Embedding (UCE)

Model Details

Model Name: Universal Cell Embedding (UCE)

Model Version: 1.0

Model Description: A large-scale self-supervised transformer-based model, pre-trained across more than 36 million cells for creating the Integrated Mega-scale Atlas with more than 1,000 uniquely named cell types, from hundreds of experiments, dozens of tissues and eight species. UCE represents the sample’s genes by their protein products, using a large protein language model. This allows UCE to meaningfully represent any gene, from any species, regardless of whether the species had appeared in the training data. The model enables cell type annotation prediction, hypothesis generation, disease state comparison, new data mapping, integration of diverse single-cell datasets, and opens the door for discovery of novel cell type functions

Model Developers

Developed By: Yanay Rosen and Yusuf Roohani conceived the study, performed research, contributed to new analytical tools, designed algorithmic frameworks, analyzed data, performed experiments and developed the software. Other author contributions

Contact Information: jure@cs.stanford.edu, quake@stanford.edu

Model Type

Architecture: Transformer-based

Domain: Cell Biology, Bioinformatics

Input Data: Single-cell transcriptomics data

Model Purpose

Technical usage:

Tokenizing genes corresponding to its protein product
Pre-training
Running in Zero-shot setting
Extracting and plotting cell embeddings

Broader research applications:

Designed to address questions in cell and molecular biology
Generation of representations of new single-cell expression datasets with no model fine-tuning or retraining while still remaining robust to dataset and batch-specific artifacts
Cell type prediction in large single-cell datasets with no additional model retraining
Mapping of new data into a universal embedding space that aligns cell types across tissues and species
Hypothesis generation in biological research
Novel cross-dataset discoveries

Training Data

Data Sources:

Public single-cell transcriptomic datasets (e.g., CellXGene, various GEO datasets)
Data from multiple species (including humans, mice, lemurs, zebrafish, pigs, monkeys, and frogs) and tissues to ensure diversity
Download the full list of datasets used to train UCE here

Data Volume:

Trained across more than 300 datasets consisting of over 36 million cells and more than 1,000 different cell types

Preprocessing:

The creation of the Integrated Mega-scale Atlas involved filtering: - Duplicate cells by selecting primary cells only - For datasets from the CxG Census, cells by minimum gene counts (200) and genes by a minimum cells count of 10. No highly variable gene selection was applied

Model Performance

Evaluation Metrics:

Zero-shot embedding quality and clustering using metrics from the single-cell integration benchmark
Cell type organization
Comparison to cell ontology
Zero-shot cell type alignment to Integrated Mega-scale Atlas

Testing Data:

Held-out subsets of training datasets
External validation using diverse single-cell datasets

Ethical Considerations

Bias and Fairness:

Inclusion of diverse species and cell types to minimize bias
Continuous evaluation for potential biases

Model Limitations

Known Limitations:

Analyses and corresponding benchmarks are generally limited by their emphasis on broad, coarse-grained cell type labels
Current scRNA-seq foundation models, including UCE, do not utilize the detailed information contained in the raw RNA transcripts

Future Improvements:

New analyses and benchmarks should focus on more detailed, fine-grained cell type classifications
Incorporation of genomic precision at the transcript level
Simulation of the biological processes of cells, leading to the creation of "Virtual Cells"

How to Use

Input Format:

UCE takes as an input (1) scRNA-seq count data (cell by gene count matrix) and (2) the corresponding protein embeddings, generated by a large protein language model, [ESM2] (https://www.science.org/doi/10.1126/science.ade2574?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%20%200pubmed), for the genes in the dataset

Output Format:

JSON or h5ad format with cell type annotations and embeddings

Example Usage:

from helical.models.uce import UCE, UCEConfig
import anndata as ad

configurer=UCEConfig(batch_size=10)
uce = UCE(configurer=configurer)

ann_data = ad.read_h5ad("dataset.h5ad")
data_loader = uce.process_data(ann_data)

embeddings = uce.get_embeddings(data_loader)

print(embeddings.shape)

Download processed datasets used in the paper here

Example Fine-Tuning:

from helical.models.uce import UCEConfig, UCEFineTuningModel
import anndata as ad

# Load the data
ann_data = ad.read_h5ad("dataset.h5ad")

# Get unique output labels
label_set = set(cell_types)

# Create the fine-tuning model with the desired configs
configurer=UCEConfig(batch_size=10)
uce_fine_tune = UCEFineTuningModel(uce_config=configurer, fine_tuning_head="classification", output_size=len(label_set))

# Process the data for training
dataset = uce_fine_tune.process_data(ann_data)

# Get the desired label class
cell_types = list(ann_data.obs.cell_type)

# Create a dictionary mapping the classes to unique integers for training
class_id_dict = dict(zip(label_set, [i for i in range(len(label_set))]))

for i in range(len(cell_types)):
    cell_types[i] = class_id_dict[cell_types[i]]

# Fine-tune
uce_fine_tune.train(train_input_data=dataset, train_labels=cell_types)

Contact

jure@cs.stanford.edu, quake@stanford.edu

Citation

@article{rosen2023universal,
   title={Universal Cell Embeddings: A Foundation Model for Cell Biology},
   author={Rosen, Yuri and Roohani, Yusuf and Agarwal, Akshay and Samotorčan, Luka and {Tabula Sapiens Consortium} and Quake, Stephen R and Leskovec, Jure},
   journal={bioRxiv},
   year={2023},
   doi={10.1101/2023.11.28.568918}
}

Author contributions

Y.RS., Y.RH., S.Q. and J.L. conceived the study. Y.RS, Y.RH., S.Q. and J.L. performed research, contributed new analytical tools, designed algorithmic frameworks, analyzed data and wrote the manuscript. Y.RS. and Y.RH. performed experiments and developed the software. A.A. and L.S. contributed to code and performed analyses. T.S. provided annotated data.