Model
helical.models.hyena_dna.HyenaDNA
Bases: HelicalDNAModel
HyenaDNA model. This class represents the HyenaDNA model, which is a long-range genomic foundation model pretrained on context lengths of up to 1 million tokens at single nucleotide resolution.
Example
from helical.models.hyena_dna import HyenaDNA, HyenaDNAConfig
hyena_config = HyenaDNAConfig(model_name = "hyenadna-tiny-1k-seqlen-d256")
model = HyenaDNA(configurer = hyena_config)
sequence = 'ACTG' * int(1024/4)
tokenized_sequence = model.process_data(sequence)
embeddings = model.get_embeddings(tokenized_sequence)
print(embeddings.shape)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
configurer
|
HyenaDNAConfig
|
The model configuration. |
default_configurer
|
Notes
The link to the paper can be found [here](https://arxiv.org/abs/2306.15794. We use the implementation from the HyenaDNA repository.
Source code in helical/models/hyena_dna/model.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
|
process_data(sequences, return_tensors='pt', padding='max_length', truncation=True)
Process the input DNA sequence.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sequences
|
list[str] or DataFrame
|
The input DNA sequences to be processed. If a DataFrame is provided, it should have a column named 'Sequence'. |
required |
return_tensors
|
str
|
The return type of the processed data. |
"pt"
|
padding
|
str
|
The padding strategy to be used. |
"max_length"
|
truncation
|
bool
|
Whether to truncate the sequences or not. |
True
|
Returns:
Type | Description |
---|---|
Dataset
|
Containing processed DNA sequences. |
Source code in helical/models/hyena_dna/model.py
get_embeddings(dataset)
Get the embeddings for the tokenized sequence.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
Dataset
|
The output dataset from |
required |
Returns:
Type | Description |
---|---|
ndarray
|
The embeddings for the tokenized sequence in the form of a numpy array. |