Model Card for HyenaDNA
Model Details
Model Name: HyenaDNA
Model Version: 1.0
Model Description: HyenaDNA, based on the Hyena architecture, is designed for long-range genomic sequence analysis with single nucleotide resolution.
Model Developers
Developed By: Eric Nguyen, Michael Poli, Marjan Faizi, Armin W. Thomas, Callum Birch Sykes, Michael Wornow, Aman Patel, Clayton Rabideau, Stefano Massaroli, Yoshua Bengio, Stefano Ermon, Stephen A. Baccus, Christopher Ré
Institutions: Stanford University, Harvard University, SynTensor, Mila, Université de Montréal
Contact Information: GitHub Repository
License: Apache 2.0
Model Type
Architecture: Decoder-only sequence-to-sequence
Domain: Genomics
Input Data: DNA sequences at single nucleotide resolution
Model Purpose
Intended Use:
- Research in genomics
- Computational biology
Out-of-Scope Use Cases:
- Direct clinical applications without further validation
Training Data
Data Sources:
- Human reference genome
- Data Source Link
Model Performance
Evaluation Metrics:
- Accuracy, Precision, Recall, F1-Score
Performance Benchmarks:
- We create the probing results with the pre-trained HyenaDNA model and compare it to the results from the paper. We provide the notebook to re-produce our results.
- The tutorial Hyena-DNA-Inference.ipynb was used as a basis to create this comparison, as well as the values from the Hyena and the Nucleotide transformer (NT) papers.
- Probing was used for the 18 downstream tasks, where the HyenaDNA embeddings of nucleotide sequences were used as features to a simpler neural network.
- The same neural network with the same hyperparameters across all the tasks was used to generate these results.
- Our results underperform in comparison to the fine-tuned models. This is due to the much larger models being used for the NT, while the Hyena model was pre-trained from scratch for the better performances.
Dataset | Metric | HyenaDNA pre-trained (probing) - Helical | NT (fine-tuned) - Original | GPT - Original | HyenaDNA pretrained (fine-tuned) - Original | HyenaDNA not pretrained - Original |
---|---|---|---|---|---|---|
H4ac | MCC | 33.27% | 50.10% | 36.40% | 63.70% | 43.50% |
H3K36me3 | MCC | 46.65% | 63.20% | 47.80% | 65.30% | 53.40% |
splice_sites_donors | F1 | 77.08% | 98.40% | 98.10% | 97.30% | 96.50% |
splice_sites_acceptors | F1 | 77.20% | 99.00% | 97.60% | 96.60% | 96.60 |
H3 | MCC | 72.07% | 81.40% | 75.80% | 81.70% | 79.90% |
H4 | MCC | 72.35% | 82.20% | 77.70% | 79.60% | 79.10% |
H3K4me3 | MCC | 24.04% | 42.10% | 28.30% | 61.20% | 40.20% |
splice_sites_all | F1 | 57.15% | 98.30% | 98.00% | 97.90% | 97.30% |
H3K4me1 | MCC | 38.11% | 55.90% | 38.70% | 57.10% | 43.40% |
H3K14ac | MCC | 36.69% | 55.00% | 41.60% | 66.30% | 48.00% |
enhancers_types | MCC | 34.62% | 47.40% | 51.90% | 55.70% | 48.40% |
promoter_no_tata | F1 | 93.84% | 97.70% | 96.60% | 96.60% | 96.50% |
H3K79me3 | MCC | 54.54% | 64.20% | 58.90% | 71.60% | 59.70% |
H3K4me2 | MCC | 27.00% | 32.60% | 28.80% | 53.90% | 34.50% |
promoter_tata | F1 | 91.91% | 96.40% | 96.60% | 96.70% | 96.10% |
enhancers | MCC | 48.02% | 58.00% | 59.30% | 62.60% | 58.60% |
H3K9ac | MCC | 43.01% | 57.50% | 49.20% | 65.10% | 52.60% |
promoter_all | F1 | 93.99% | 97.40% | 96.30% | 96.50% | 96.10% |
Ethical Considerations
Bias and Fairness:
- Trained only on the human reference genome
Privacy:
- Uses publicly available genomic data
Mitigations:
- Continuous monitoring for biases
Model Limitations
Known Limitations:
- Limited to genomic data
Future Improvements:
- Expansion to include diverse genomic datasets
How to Use
Input Format:
- DNA sequence strings
Output Format:
- JSON objects with genomic feature predictions
Example Usage:
from helical.models.hyena_dna import HyenaDNA, HyenaDNAConfig
hyena_config = HyenaDNAConfig(model_name = "hyenadna-tiny-1k-seqlen-d256")
model = HyenaDNA(configurer = hyena_config)
sequence = 'ACTG' * 1024
tokenized_sequence = model.process_data(sequence)
embeddings = model.get_embeddings(tokenized_sequence)
print(embeddings.shape)
Example Fine-Tuning:
from datasets import load_dataset
from helical.models.hyena_dna import HyenaDNAConfig, HyenaDNAFineTuningModel
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load a Hugging Face dataset and task type
ds = load_dataset("dataset", "task")
# Define the desired configs
config = HyenaDNAConfig(device=device, batch_size=10)
# Define the fine-tuning model with the configs we instantiated above
hyena_fine_tune = HyenaDNAFineTuningModel(config, "classification", number_unique_outputs)
# Prepare the sequences for input to the model
input_dataset = hyena_fine_tune.process_data(ds["train"]["sequence"])
# train the fine-tuning model on some downstream task
hyena_fine_tune.train(input_dataset, ds["train"]["label"])
Citation
When using HyenaDNA, please cite the original paper and use the DOI link provided in the GitHub repository: