Config

`helical.models.c2s.Cell2SenConfig`

Configuration class for the Cell2Sen Model.

Parameters:

Name	Type	Description	Default
`batch_size`	`int`	int: Number of samples to process in each batch during model operations. Default is 16.	`16`
`organism`	`str`	The organism from which the cell data is derived (e.g., 'human', 'mouse').	`None`
`perturbation_column`	`str`	Column name in the input data that specifies the perturbation applied to cells.	`None`
`max_new_tokens`	`int`	Maximum number of new tokens that the model can generate for prediction. Default is 200. One gene is roughly 4 tokens.	`200`
`return_fit`	`bool`	Whether to return model fit parameters in outputs. Default is False. This fits a linear model (y=mx+c) to the gene rank and expression values in log10-transformed space and can be used to map between expression values and gene ranks. The paper shows this is well captured by a linear model. The fit parameters are returned in the `fit_parameters` field.	`False`
`dtype`	`str`	Data type for the model. Default is "bfloat16".	`'bfloat16'`
`model_size`	`str`	Size of the model. Default is "2B". Choices are "2B" or "27B".	`'2B'`
`use_quantization`	`bool`	Whether to use 4-bit quantization. Default is False.	`False`
`seed`	`int`	Random seed for reproducibility. Default is 42.	`42`
`use_flash_attn`	`bool`	Whether to use flash attention 2 for attention implementation. Default is False. Only available for CUDA devices. If True, the attention implementation will be set to "flash_attention_2". If False, the attention implementation will be set to "sdpa".	`False`
`max_genes`	`int`	Maximum number of genes to use for the model. Default is None. If None, all nonzero expressed genes will be used. If a number is provided, the genes will be sorted by expression level and the top max_genes will be used.	`None`
`aggregation_type`	`Literal['mean_pool', 'last_token']`	How to aggregate final-layer hidden states into a single embedding. Defaults to "mean_pool". "mean_pool": Computes the mean of all non-padding token embeddings in the last layer. "last_token": Uses only the embedding of the final non-padding token (i.e., the position where the model would predict the next token).	`'mean_pool'`
`embedding_prompt_template`	`str`	Optional custom embedding prompt template used to query the model. If None, a default built-in prompt template is used. Example: 'You are given a list of genes in descending order of expression levels in a {organism} cell. Genes: {cell_sentence} Using this information, describe the function of the cell in a few words. Answer:'	`None`
`device`	`Literal['cpu', 'cuda']`	Device to use for the model. Default is "cpu". Choices are "cpu" or "cuda".	`'cpu'`

Source code in helical/models/c2s/config.py

class Cell2SenConfig:
    """
    Configuration class for the Cell2Sen Model.

    Parameters
    ----------
    batch_size: int = 16
        int: Number of samples to process in each batch during model operations. Default is 16.

    organism: str = None
        The organism from which the cell data is derived (e.g., 'human', 'mouse').

    perturbation_column: str = None
        Column name in the input data that specifies the perturbation applied to cells.

    max_new_tokens: int = 200
        Maximum number of new tokens that the model can generate for prediction. Default is 200.
        One gene is roughly 4 tokens. 

    return_fit: bool = False
        Whether to return model fit parameters in outputs. Default is False. This fits a linear model (y=mx+c) to the gene rank and expression values in log10-transformed space
        and can be used to map between expression values and gene ranks. The paper shows this is well captured by a linear model. The fit parameters are returned in the `fit_parameters` field.

    dtype: str = "bfloat16"
        Data type for the model. Default is "bfloat16". 

    model_size: str = "2B"
        Size of the model. Default is "2B".
        Choices are "2B" or "27B".

    use_quantization: bool = False
        Whether to use 4-bit quantization. Default is False.

    seed: int = 42
        Random seed for reproducibility. Default is 42.

    use_flash_attn: bool = False
        Whether to use flash attention 2 for attention implementation. Default is False.
        Only available for CUDA devices.
        If True, the attention implementation will be set to "flash_attention_2".
        If False, the attention implementation will be set to "sdpa".

    max_genes: int = None
        Maximum number of genes to use for the model. Default is None.
        If None, all nonzero expressed genes will be used.
        If a number is provided, the genes will be sorted by expression level and the top max_genes will be used.

    aggregation_type: Literal["mean_pool", "last_token"] = "mean_pool"
        How to aggregate final-layer hidden states into a single embedding. Defaults to "mean_pool".
        "mean_pool": Computes the mean of all non-padding token embeddings in the last layer.
        "last_token": Uses only the embedding of the final non-padding token (i.e., the position where the model would predict the next token).

    embedding_prompt_template: str = None
        Optional custom embedding prompt template used to query the model.
        If None, a default built-in prompt template is used.
        Example: 'You are given a list of genes in descending order of expression levels in a {organism} cell. \n
        Genes: {cell_sentence} \n
        Using this information, describe the function of the cell in a few words. Answer:'

    device: Literal["cpu", "cuda"] = "cpu"
        Device to use for the model. Default is "cpu".
        Choices are "cpu" or "cuda".

    """
    def __init__(
        self,
        batch_size: int = 16,
        organism: str = None,
        perturbation_column: str = None,
        max_new_tokens: int = 200,
        max_genes: int = None,
        aggregation_type: Literal["mean_pool", "last_token"] = "mean_pool",
        embedding_prompt_template: str = None,
        return_fit: bool = False,
        dtype: str = "bfloat16", 
        model_size: str = "2B",  
        device: Literal["cpu", "cuda"] = "cpu",
        use_quantization: bool = False,
        seed: int = 42,
        use_flash_attn: bool = False,
    ):

        if model_size == "2B":
            model_path = Path(CACHE_DIR_HELICAL, "c2s_model_2B")
            hf_model_path = "vandijklab/C2S-Scale-Gemma-2-2B"
            # list_of_files_to_download = [
            #     "c2s_model/config.json",
            #     "c2s_model/generation_config.json",
            #     "c2s_model/model-00001-of-00002.safetensors",
            #     "c2s_model/model-00002-of-00002.safetensors",
            #     "c2s_model/model.safetensors.index.json",
            #     "c2s_model/special_tokens_map.json",
            #     "c2s_model/tokenizer_config.json",
            #     "c2s_model/tokenizer.json",
            # ]
        elif model_size == "27B":
            model_path = Path(CACHE_DIR_HELICAL, "c2s_model_27B")
            hf_model_path = "vandijklab/C2S-Scale-Gemma-2-27B"
            # list_of_files_to_download = [
            #     "c2s_model_27B/config.json",
            #     "c2s_model_27B/generation_config.json",
            #     "c2s_model_27B/model-00001-of-00012.safetensors",
            #     "c2s_model_27B/model-00002-of-00012.safetensors",
            #     "c2s_model_27B/model-00003-of-00012.safetensors",
            #     "c2s_model_27B/model-00004-of-00012.safetensors",
            #     "c2s_model_27B/model-00005-of-00012.safetensors",
            #     "c2s_model_27B/model-00006-of-00012.safetensors",
            #     "c2s_model_27B/model-00007-of-00012.safetensors",
            #     "c2s_model_27B/model-00008-of-00012.safetensors",
            #     "c2s_model_27B/model-00009-of-00012.safetensors",
            #     "c2s_model_27B/model-00010-of-00012.safetensors",
            #     "c2s_model_27B/model-00011-of-00012.safetensors",
            #     "c2s_model_27B/model-00012-of-00012.safetensors",
            #     "c2s_model_27B/model.safetensors.index.json",
            #     "c2s_model_27B/special_tokens_map.json",
            #     "c2s_model_27B/tokenizer_config.json",
            #     "c2s_model_27B/tokenizer.json",
            # ]
        else:
            raise ValueError(f"Model size {model_size} not supported. Please choose from '2B' or '27B'.")

        self.config = {
            "hf_model_path": hf_model_path,
            # "list_of_files_to_download": list_of_files_to_download,
            "model_path": model_path,
            "batch_size": batch_size,
            "organism": organism,
            "perturbation_column": perturbation_column,
            "max_new_tokens": max_new_tokens,
            "return_fit": return_fit,
            "use_quantization": use_quantization,
            "seed": seed,
            "dtype": dtype,
            "model_size": model_size,
            "use_flash_attn": use_flash_attn,
            "max_genes": max_genes,
            "aggregation_type": aggregation_type,
            "embedding_prompt_template": embedding_prompt_template,
            "device": device,
        }