<div style="border: 2px solid #8A9AD0; margin: 1em 0.2em; padding: 0.5em;">

# Pretraining a Large Language Model (LLM) from Scratch on DNA Sequences

by [Raphael Mourad](https://training.galaxyproject.org/hall-of-fame/raphaelmourad/), [B√©r√©nice Batut](https://training.galaxyproject.org/hall-of-fame/bebatut/)

CC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)

**Objectives**

- How to load and configure a pre-trained language model for DNA sequence analysis?
- What is the process for tokenizing DNA sequences to prepare them for model training?
- How to split and organize DNA sequence dataset for effective model training and evaluation?
- What are the key hyperparameters to consider when pretraining a language model on DNA sequences, and how to configure them?
- How to use a trained language model to generate and interpret embeddings for DNA sequences?

**Objectives**

- Identify and load a pre-trained language model (LLM) suitable for DNA sequence analysis.
- Explain the role of a tokenizer in converting DNA sequences into numerical tokens for model processing.
- Prepare and tokenize DNA sequence datasets for model training and evaluation.
- Configure and implement data collation to organize tokenized data into batches for efficient training.
- Define and configure hyperparameters for pretraining a model, such as learning rate and batch size.
- Monitor and evaluate the model's performance during training to ensure effective learning.
- Use the trained model to generate embeddings for DNA sequences and interpret these embeddings for downstream bioinformatics applications.
- Develop a complete workflow for training a language model on DNA sequences, from data preparation to model evaluation, and apply it to real-world bioinformatics tasks.

**Time Estimation: 3H**
</div>


<p><strong>Generative Artificial Intelligence</strong> (AI) represents a cutting-edge domain within machine learning, focused on creating new, synthetic yet realistic data. This includes generating text, images, music, and even biological sequences. At the heart of many generative AI applications are <strong>Large Language Models</strong> (LLMs), which have revolutionized natural language processing and beyond.</p>
<p>LLMs are <strong>sophisticated neural networks</strong> trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on <strong>Transformers</strong>, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery.</p>
<blockquote class="details" style="border: 2px solid #ddd; margin: 1em 0.2em">
<div class="box-title details-title" id="details-transformers"><button class="gtn-boxify-button details" type="button" aria-controls="details-transformers" aria-expanded="true"><i class="fas fa-info-circle" aria-hidden="true" ></i> <span>Details:  Transformers </span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<p>Transformers are a type of neural network model designed to handle sequential data, such as text, by using self-attention mechanisms to weigh the importance of input elements relative to each other, enabling the model to understand and generate coherent and contextually relevant outputs.</p>
</blockquote>
<p>In this tutorial, we will explore the intersection of generative AI and genomics by <strong>pretraining an LLM from scratch on DNA sequences</strong>. This process will equip the model with a foundational understanding of the ‚Äúgrammar‚Äù of DNA, enabling it to generate and analyze genetic data with remarkable accuracy.</p>
<p><a href="https://mistral.ai/">Mistral AI</a>, French artificial intelligence (AI) startup, recently launched large language models (LLMs) showing performances superior to Llama2. In particular, Mixtral-8x7B implements:</p>
<ul>
<li><strong>Grouped-Query Attention</strong>: Efficiently computes attention by grouping queries, reducing computational load and memory usage.</li>
<li><strong>Sliding-Window Attention</strong>: Focuses on a fixed-size window of tokens, sliding over the sequence to manage long texts efficiently.</li>
<li><strong>Byte-fallback BPE Tokenizer</strong>: Tokenizes text into subword units, falling back to byte-level tokenization for unknown words, ensuring robust handling of diverse text inputs.</li>
</ul>
<p>These techniques collectively enhance the performance and efficiency of large language models, enabling them to process and generate text more effectively.</p>
<p>In this tutorial, we will use a simplified Mistral model architecture with fewer layers and hidden units to reduce computational requirements. The model will be trained to predict the next base in the sequence. For instance, for a sequence like <code style="color: inherit">ATTTGTTGGT</code>, the model will be trained to predict the suffix <code style="color: inherit">TTGGT</code> given the prefix <code style="color: inherit">ATTTG</code>. This process is called <strong>causal language modeling</strong>.</p>
<p>To pretrain the model, we will use a file containing 100,000 non-overlapping DNA sequences of 200 bases, corresponding to around 1% of the human genome (hg38 assembly). This involves training the model to predict the end of a DNA sequence.</p>
<p>By the end of this tutorial, we will obtain a Mistral-DNA model with an internal representation of DNA sequence grammar. This pretrained model can then be used for various applications, such as fine-tuning for classification tasks or predicting mutational effects.</p>
<blockquote class="agenda" style="border: 2px solid #86D486;display: none; margin: 1em 0.2em">
<div class="box-title agenda-title" id="agenda">Agenda</div>
<p>In this tutorial, we will cover:</p>
<ol id="markdown-toc">
<li><a href="#prepare-resources" id="markdown-toc-prepare-resources">Prepare resources</a>    <ol>
<li><a href="#install-dependencies" id="markdown-toc-install-dependencies">Install dependencies</a></li>
</ol>
</li>
</ol>
</blockquote>
<h1 id="prepare-resources">Prepare resources</h1>
<p>To pretrain the model, let‚Äôs open a Notebook or a Python script.</p>
<h2 id="install-dependencies">Install dependencies</h2>
<p>The first step is to install the required dependencies:</p>


In [None]:
!pip install accelerate
!pip install datasets==3.0.1
!pip install transformers
!pip install torch
!pip install flash-attn

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What are the required dependencies doing?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution"><button class="gtn-boxify-button solution" type="button" aria-controls="solution" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ul>
<li>
<p><code style="color: inherit">accelerate</code>: A library by <a href="https://huggingface.co/">Hugging Face</a> ‚Äì a platform that provides tools and resources for building, training, and deploying machine learning models ‚Äì designed to simplify the process of training and deploying machine learning models across different hardware environments. It provides tools to optimize performance on GPUs, TPUs, and other accelerators, making it easier to scale models efficiently.</p>
</li>
<li>
<p><code style="color: inherit">datasets</code>: A library by Hugging Face for managing and processing datasets. It provides tools to load, manipulate, and share datasets in a standardized format, making it easier to work with machine learning data.</p>
</li>
<li>
<p><code style="color: inherit">numpy</code>: A fundamental package for scientific computing in Python.</p>
</li>
<li>
<p><code style="color: inherit">torch</code>: Also known as PyTorch, it is an open-source machine learning library developed by Facebook‚Äôs AI Research lab. It provides a flexible platform for building and training neural networks, with a focus on tensor computations and automatic differentiation.</p>
</li>
<li>
<p><code style="color: inherit">transformers</code>: A library by Hugging Face that provides implementations of state-of-the-art transformer models for natural language processing (NLP). It includes pre-trained models and tools for fine-tuning, making it easier to apply transformers to various NLP tasks.</p>
</li>
<li>
<p><code style="color: inherit">flash-attn</code>: Implementation of FlashAttention, a Fast and Memory-Efficient Exact Attention with IO-Awareness
These libraries are widely used in the machine learning and data science communities for their efficiency, flexibility, and extensive functionality.</p>
</li>
</ul>
</details>
</blockquote>
<h2 id="import-python-libraries">Import Python libraries</h2>
<p>Let‚Äôs now import them.</p>


In [None]:
import os

import accelerate
import flash_attn
import torch
import transformers
from datasets import load_dataset
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    DataCollatorForLanguageModeling,
    EarlyStoppingCallback,
    Trainer,
    TrainingArguments,
)

<blockquote class="details" style="border: 2px solid #ddd; margin: 1em 0.2em">
<div class="box-title details-title" id="details-loaded-functions-and-classes-from-datasets-and-transformers-libraries"><button class="gtn-boxify-button details" type="button" aria-controls="details-loaded-functions-and-classes-from-datasets-and-transformers-libraries" aria-expanded="true"><i class="fas fa-info-circle" aria-hidden="true" ></i> <span>Details: Loaded functions and classes from datasets and transformers libraries</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ul>
<li><code style="color: inherit">datasets</code>:
<ul>
<li><code style="color: inherit">load_dataset</code>: function to load datasets from the Hugging Face Hub or local files.</li>
</ul>
</li>
<li><code style="color: inherit">transformers</code>:
<ul>
<li><code style="color: inherit">AutoConfig</code>: Automatically loads the configuration for a pre-trained model. It defines the architecture and hyperparameters of the model.</li>
<li><code style="color: inherit">AutoModelForCausalLM</code>: Loads a pre-trained causal language model for tasks like text generation, where the model predicts the next token in a sequence.</li>
<li><code style="color: inherit">AutoTokenizer</code>: Loads the tokenizer associated with a pre-trained model. It converts text into tokens that the model can process.</li>
<li><code style="color: inherit">DataCollatorForLanguageModeling</code>: A data collator specifically designed for language modeling tasks. It prepares batches of data for training by handling padding and masking.</li>
<li><code style="color: inherit">EarlyStoppingCallback</code>: A callback used during training to stop the process early if the model‚Äôs performance on the validation set stops &gt; improving, saving time and resources.</li>
<li><code style="color: inherit">Trainer</code>: A high-level API for training and evaluating transformer &gt; models. It simplifies the training loop and handles tasks like gradient accumulation and evaluation.</li>
<li><code style="color: inherit">TrainingArguments</code>: A class to define the training configuration, including hyperparameters like learning rate, batch size, and number &gt; of epochs. It is used to configure the <code style="color: inherit">Trainer</code>.</li>
</ul>
</li>
</ul>
<p>These components work together to streamline the process of training and fine-tuning transformer models for various NLP tasks.</p>
</blockquote>
<blockquote class="comment" style="border: 2px solid #ffecc1; margin: 1em 0.2em">
<div class="box-title comment-title" id="comment-versions"><i class="far fa-comment-dots" aria-hidden="true" ></i> Comment: Versions</div>
<p>This tutorial has been tested with following versions:</p>
<ul>
<li><code style="color: inherit">accelerate</code> &gt; 0.32.1</li>
<li><code style="color: inherit">flash_attn</code> &gt; 2.6.0.post1 and 2.7.0.post2</li>
<li><code style="color: inherit">transformers</code> &gt; 4.47.1</li>
</ul>
<p>You can check the versions with:</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">accelerate.__version__
flash_attn.__version__
transformers.__version__
</code></pre></div>  </div>
</blockquote>
<h2 id="check-and-configure-available-resources">Check and configure available resources</h2>
<p>To pretrain the model, we need to specific resources:</p>
<ul>
<li><strong>Graphics Processing Unit</strong> (GPU): a specialized processor designed to handle complex graphical computations, often used for rendering images, videos, and accelerating machine learning tasks</li>
<li><strong>Video Random Access Memory</strong> (VRAM): dedicated memory used by a GPU to store and process graphical data, enabling smooth rendering of images and videos</li>
</ul>
<p>Let‚Äôs check the resources:</p>


In [None]:
!nvidia-smi

<p>The command <code style="color: inherit">nvidia-smi</code> (NVIDIA System Management Interface) is used to monitor and manage NVIDIA GPU devices. It provides information about the GPU‚Äôs utilization, memory usage, temperature, and running processes. This tool is essential for developers and researchers to track the performance and health of GPUs, especially when running computationally intensive tasks like machine learning training.</p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-1"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>How do you interpret the following output?</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">Tue Mar 25 13:49:35 2025
+-----------------------------------------------------------------------------&gt; ------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA &gt; Version: 12.4     |
|-----------------------------------------+------------------------&gt; +----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile &gt; Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | &gt; GPU-Util  Compute M. |
|                                         |                        |          &gt;      MIG M. |
|=========================================+========================&gt; +======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 &gt; Off |                    0 |
| N/A   40C    P8              9W /   70W |       2MiB /  15360MiB |      &gt; 0%      Default |
|                                         |                        |          &gt;         N/A |
+-----------------------------------------+------------------------&gt; +----------------------+
                                                                              &gt;
+-----------------------------------------------------------------------------&gt; ------------+
| &gt; Processes:                                                                    &gt;           |
|  GPU   GI   CI        PID   Type   Process &gt; name                              GPU Memory |
|        ID   &gt; ID                                                               Usage      |
|&gt; ==============================================================================&gt; ===========|
|  No running processes &gt; found                                                             |
+-----------------------------------------------------------------------------&gt; ------------+
</code></pre></div>  </div>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-1"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-1" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ul>
<li><code style="color: inherit">Driver Version</code>: The version of the NVIDIA driver installed on the system (<code class="language-plaintext highlighter-rouge">550.54.15</code>).</li>
<li><code style="color: inherit">CUDA Version</code>: The version of CUDA installed, which is a parallel computing platform and API model created by NVIDIA (<code class="language-plaintext highlighter-rouge">12.4</code>).</li>
<li><code style="color: inherit">GPU Name</code>: The model of the GPU, in this case, a <code style="color: inherit">Tesla T4</code>.</li>
<li><code style="color: inherit">Persistence-M</code>: Indicates whether Persistence Mode is enabled (<code style="color: inherit">Off</code> in this case), which can improve performance for certain applications.</li>
<li><code style="color: inherit">Bus-Id</code>: The PCI bus ID of the GPU (<code class="language-plaintext highlighter-rouge">00000000:00:04.0</code>).</li>
<li><code style="color: inherit">Fan</code>: The speed of the GPU fan (<code style="color: inherit">N/A</code> means not available or not reporting).</li>
<li><code style="color: inherit">Temp</code>: The current temperature of the GPU (<code class="language-plaintext highlighter-rouge">40¬∞C</code>).</li>
<li><code style="color: inherit">Perf</code>: The performance state of the GPU (P8 indicates a low-power state).</li>
<li><code style="color: inherit">Pwr:Usage/Cap</code>: The current power usage (9W) and the power cap (70W).</li>
<li><code style="color: inherit">Memory-Usage</code>: The amount of GPU memory currently in use (2MiB) out of the total available (15360MiB).</li>
<li><code style="color: inherit">GPU-Util</code>: The percentage of GPU utilization (0% indicates the GPU is idle).</li>
<li><code style="color: inherit">Compute M.</code>: The compute mode of the GPU (Default).</li>
<li><code style="color: inherit">Processes</code>: Lists any processes currently using the GPU. In this case, there are no running processes.</li>
</ul>
</details>
</blockquote>
<p>Let‚Äôs configure PyTorch and the CUDA environment ‚Äì software and hardware ecosystem provided by NVIDIA to enable parallel computing on GPU ‚Äì to optimize GPU memory usage and performance:</p>
<ol>
<li>
<p>Enables CuDNN benchmarking in PyTorch:</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit"> torch.backends.cudnn.benchmark=True
</code></pre></div>    </div>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-2"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<ol>
<li>What is CuDNN?</li>
<li>Why enabling benchmarking?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-2"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-2" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>CuDNN is a GPU-accelerated library for deep neural networks.</li>
<li>Enabling benchmarking allows CuDNN to select the fastest algorithms for the specific GPU and input size. This can improve the performance of the model, especially for fixed-size inputs.</li>
</ol>
</blockquote>
</blockquote>
</li>
<li>
<p>Set an environment variable that configures how PyTorch manages CUDA memory allocations</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit"> os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
</code></pre></div>    </div>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-3"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What is this command doing?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-3"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-3" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<p>It sets the maximum split size for memory allocations to 32 megabytes. This can help reduce memory fragmentation and improve memory utilization, which is particularly useful when working with large models or limited GPU memory.</p>
</blockquote>
</blockquote>
</li>
</ol>
<h1 id="prepare-the-model">Prepare the model</h1>
<h2 id="load-the-model">Load the model</h2>
<p>Let‚Äôs load now the model, <code style="color: inherit">Mistral-DNA</code>. The Mixtral model (<a href="https://huggingface.co/mistralai/Mixtral-8x7B-v0.1">Mixtral-8x7B-v0.1</a>) ‚Äì <a href="https://mistral.ai/news/mixtral-of-experts">a pretrained generative Sparse Mixture of Experts outperforming Llama 2 70B</a> ‚Äì was modified to significantly reduce the number of parameters mostly by removing layers, such that it could be trained on a GPU such as an RTX3090.</p>
<p>We will get the model from GitHub:</p>


In [None]:
!git clone https://github.com/raphaelmourad/Mistral-DNA.git

<p>Let‚Äôs check if we have the model now:</p>


In [None]:
!ls

<p>We should get two folders: <code style="color: inherit">Mistral-DNA</code> and <code style="color: inherit">sample_data</code>. Let‚Äôs change the current working directory to <code style="color: inherit">Mistral-DNA/</code>:</p>


In [None]:
os.chdir("Mistral-DNA/")

<h2 id="choose-the-llm-architecture">Choose the LLM architecture</h2>
<p>Let‚Äôs look at the original archicture of <code style="color: inherit">Mixtral-8x7B-v0.1</code> which is stored in the <code style="color: inherit">data/models/Mixtral-8x7B-v0.1</code> folder (<a href="https://github.com/raphaelmourad/Mistral-DNA/tree/main/data/models/Mixtral-8x7B-v0.1">GitHub</a>).</p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-4"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<ol>
<li>Which file is essential for the configuring the language model?</li>
<li>What are the key parameters of the simplified architecture used here?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-4"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-4" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>The <a href="https://github.com/raphaelmourad/Mistral-DNA/blob/main/data/models/Mixtral-8x7B-v0.1/config.json"><code style="color: inherit">config.json</code> file</a> is essential for configuring the language model as a Mistral model. It specifies the architecture for causal language modeling (<code class="language-plaintext highlighter-rouge">MixtralForCausalLM</code>) and details the size of the neural network components. The original Mistral model has a larger hidden size, but it is reduced here to make pre-training feasible.</li>
<li>The key parameters are:
<ul>
<li><strong>Intermediate Size</strong> (<code class="language-plaintext highlighter-rouge">intermediate_size</code>): Size of the intermediate (or hidden) layers within the model. It determines the number of neurons in these layers, influencing the model‚Äôs capacity to capture complex patterns in the data. A larger intermediate size can capture more nuanced details but also requires more computational resources. Set to 256, which is relatively small compared to the original model.</li>
<li><strong>Number of Attention Heads</strong> (<code class="language-plaintext highlighter-rouge">num_attention_heads</code>): Number of attention heads in the multi-head attention mechanism. Each head allows the model to focus on different parts of the input sequence simultaneously, capturing diverse aspects of the data. More attention heads can provide a richer representation but also increase computational complexity. Reduced to 8 for efficiency.</li>
<li><strong>Number of Experts per token</strong> (<code class="language-plaintext highlighter-rouge">num_experts_per_tok</code>): Specific to models that use a Mixture of Experts (MoE) architecture. It indicates the number of expert networks that are activated for each token in the input sequence. Experts are specialized sub-networks that handle different parts of the data, improving efficiency and performance, especially for large models. Set to 1 expert per token.</li>
<li><strong>Number of Local Experts</strong> (<code class="language-plaintext highlighter-rouge">num_local_experts</code>): Number of local experts available in the model. Local experts are a subset of the total experts and are used to process specific parts of the input data. This localization can help in managing computational resources more effectively, especially when dealing with large-scale data. Set to 64.</li>
<li><strong>Vocabulary Size</strong> (<code class="language-plaintext highlighter-rouge">vocab_size</code>): Specifically designed for DNA sequences, with a size of \(4,096 = 4^6\), as DNA consists of four possible letters (A, T, C, and G) and the words are  6-mers (sequences of six nucleotides). By modeling DNA using 6-mers, we capture meaningful patterns within the genetic sequence, enabling the model to understand and generate DNA data effectively.</li>
</ul>
</li>
</ol>
</details>
</blockquote>
<p>Let‚Äôs load the configuration of the pre-trained model:</p>


In [None]:
config = AutoConfig.from_pretrained("data/models/Mixtral-8x7B-v0.1")

<p>By loading the configuration, we can inspect or modify the model‚Äôs architecture without loading the actual model weights. Let‚Äôs now initialize a causal language model from the loaded configuration object, with a specific attention implementation:</p>


In [None]:
model = AutoModelForCausalLM.from_config(config, attn_implementation="eager")

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-5"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What does <code style="color: inherit">attn_implementation="eager"</code> do?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-5"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-5" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<p><code style="color: inherit">attn_implementation="eager"</code> specifies the attention implementation to use. Setting it to ‚Äúeager‚Äù means that the attention mechanism will be executed eagerly, which can be useful for debugging or when working with dynamic computation graphs. Eager execution runs operations immediately as they are called in Python, rather than adding them to a graph for later execution.</p>
</details>
</blockquote>
<p>How does the model look like?</p>


In [None]:
model

<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">MixtralForCausalLM(
  (model): MixtralModel(
    (embed_tokens): Embedding(4096, 256)
    (layers): ModuleList(
      (0-7): 8 x MixtralDecoderLayer(
        (self_attn): MixtralAttention(
          (q_proj): Linear(in_features=256, out_features=256, bias=False)
          (k_proj): Linear(in_features=256, out_features=256, bias=False)
          (v_proj): Linear(in_features=256, out_features=256, bias=False)
          (o_proj): Linear(in_features=256, out_features=256, bias=False)
          (rotary_emb): MixtralRotaryEmbedding()
        )
        (block_sparse_moe): MixtralSparseMoeBlock(
          (gate): Linear(in_features=256, out_features=64, bias=False)
          (experts): ModuleList(
            (0-63): 64 x MixtralBlockSparseTop2MLP(
              (w1): Linear(in_features=256, out_features=256, bias=False)
              (w2): Linear(in_features=256, out_features=256, bias=False)
              (w3): Linear(in_features=256, out_features=256, bias=False)
              (act_fn): SiLU()
            )
          )
        )
        (input_layernorm): MixtralRMSNorm((256,), eps=1e-05)
        (post_attention_layernorm): MixtralRMSNorm((256,), eps=1e-05)
      )
    )
    (norm): MixtralRMSNorm((256,), eps=1e-05)
  )
  (lm_head): Linear(in_features=256, out_features=4096, bias=False)
)
</code></pre></div></div>
<p>As expected, the model is a <code style="color: inherit">MixtralForCausalLM</code> model with several key components:</p>
<ol>
<li>
<p><strong>Embedding Layer (<code class="language-plaintext highlighter-rouge">embed_tokens</code>)</strong>: Converts input DNA sequences into dense vectors of fixed size. It maps each of the 4,096 (\(4^{6}\)) possible DNA tokens (representing 6-mers) to a 256-dimensional vector space. This embedding layer is crucial for transforming discrete DNA sequences into a format suitable for neural network processing.</p>
</li>
<li><strong>Decoder Layers (<code class="language-plaintext highlighter-rouge">layers</code>)</strong>: Consists of eight <code style="color: inherit">MixtralDecoderLayer</code> modules, each containing several sub-components:
<ul>
<li>
<p><strong>Self-Attention Mechanism (<code class="language-plaintext highlighter-rouge">self_attn</code>)</strong></p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-6"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<ol>
<li>What are the components?</li>
<li>How is the purpose?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-6"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-6" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>The components are linear projections (<code class="language-plaintext highlighter-rouge">q_proj</code>, <code style="color: inherit">k_proj</code>,<code class="language-plaintext highlighter-rouge">v_proj</code>, <code style="color: inherit">o_proj</code>) for queries, keys, values, and outputs, along witha rotary embedding (<code class="language-plaintext highlighter-rouge">rotary_emb</code>) to incorporate positiona linformation.</li>
<li>This allows the model to weigh the importance of differenttokens in the sequence relative to each other, capturing dependenciesand context.</li>
</ol>
</details>
</blockquote>
</li>
<li>
<p><strong>Sparse Mixture of Experts (<code class="language-plaintext highlighter-rouge">block_sparse_moe</code>)</strong>:</p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-7"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<ol>
<li>What are the components?</li>
<li>How is the purpose?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-7"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-7" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>The components are gating mechanism (<code class="language-plaintext highlighter-rouge">gate</code>) and list of 64 expert networks (<code class="language-plaintext highlighter-rouge">experts</code>), each with multiple linear layers (<code class="language-plaintext highlighter-rouge">w1</code>, <code style="color: inherit">w2</code>, <code style="color: inherit">w3</code>) and an activation function (<code class="language-plaintext highlighter-rouge">act_fn</code>).</li>
<li>This efficiently processes input data by activating only a subset of expert networks, reducing computational load while maintaining model capacity.</li>
</ol>
</blockquote>
</blockquote>
</li>
<li>
<p><strong>Layer Normalization (<code class="language-plaintext highlighter-rouge">input_layernorm</code>, <code style="color: inherit">post_attention_layernorm</code>)</strong>: Stabilizes and accelerates the training process by normalizing the inputs and outputs of the attention mechanism.</p>
</li>
</ul>
</li>
<li>
<p><strong>Final Layer Normalization (<code class="language-plaintext highlighter-rouge">norm</code>)</strong>: Applies normalization to the output of the final decoder layer, ensuring stable and consistent outputs.</p>
</li>
<li><strong>Language Model Head (<code class="language-plaintext highlighter-rouge">lm_head</code>)</strong>: Projects the 256-dimensional output of the final decoder layer back into the 4,096-dimensional vocabulary space of DNA tokens. This linear layer (<code class="language-plaintext highlighter-rouge">Linear</code>) maps the hidden states to the original token space, enabling the model to predict the next DNA token accurately.</li>
</ol>
<p>This architecture ensures that the model can capture complex patterns in DNA sequences while maintaining computational efficiency, making it suitable for tasks like DNA sequence generation and analysis. The model‚Äôs design culminates in the output of 4,096 tokens, aligning with the input dimension. This consistency is crucial for accurately predicting the next token in a given DNA sequence, ensuring that the model‚Äôs predictions are coherent and reliable.</p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-8"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>How many parameters are in this model?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-8"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-8" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">pytorch_total_params = sum(p.numel() for p in model.parameters())
print(f"Model size: {pytorch_total_params/1000**2:.1f}M parameters")
</code></pre></div>    </div>
<p>There are 105 millions parameters. It is a big model.</p>
</blockquote>
</blockquote>
<h1 id="prepare-the-tokenizer">Prepare the tokenizer</h1>
<p>A tokenizer is a crucial component in natural language processing (NLP) that transforms raw text into a format that can be processed by machine learning models. In this section, we will load and configure the <strong>Byte-Pair Encoding (BPE) letter tokenizer</strong>. The BPE tokenizer efficiently handles rare and unknown words by breaking them down into frequent subword units, ensuring that the model can generalize better to unseen data. This process involves initializing the tokenizer with a predefined vocabulary and settings, enabling it to convert text into a format suitable for neural network processing. By doing so, we prepare the tokenizer to effectively manage DNA sequences, facilitating accurate and reliable model predictions.</p>
<p>Let‚Äôs loads a pre-trained tokenizer from the Hugging Face Model Hub. The tokenizer is associated with the model <code style="color: inherit">DNABERT-2-117M</code>, which is designed for processing DNA sequences.</p>


In [None]:
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-9"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What does the above command do?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-9"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-9" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ul>
<li><code style="color: inherit">AutoTokenizer.from_pretrained</code> automatically identifies and loads the appropriate tokenizer for the specified model. There are 1876 sequences.</li>
<li><code style="color: inherit">trust_remote_code=True</code> allows the loading of custom tokenizers that may include remote code execution. It is necessary when the tokenizer requires additional custom code to function correctly.</li>
</ul>
</details>
</blockquote>
<p>Let‚Äôs look at the created tokenizer now:</p>


In [None]:
print(tokenizer)

<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">PreTrainedTokenizerFast(name_or_path='zhihan1996/DNABERT-2-117M',vocab_size=4096, model_max_length=1000000000000000019884624838656,is_fast=True, padding_side='right', truncation_side='right',special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': [PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'},clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)
</code></pre></div></div>
<p>The <code style="color: inherit">PreTrainedTokenizerFast</code> is a fast and efficient tokenizer used to process text data for the <code style="color: inherit">DNABERT-2-117M</code> model. Here‚Äôs a breakdown of its configuration:</p>
<ul>
<li>
<p><code style="color: inherit">name_or_path='zhihan1996/DNABERT-2-117M'</code>: Specifies the name or path of the pre-trained tokenizer, indicating that it is associated with the <code style="color: inherit">DNABERT-2-117M</code> model, which is designed for processing DNA sequences.</p>
</li>
<li>
<p><code style="color: inherit">vocab_size=4096</code>: Defines the size of the tokenizer‚Äôs vocabulary.</p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-10"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>Why is the size of the tokenizer‚Äôs vocabulary set to 4,096?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-10"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-10" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<p>It corresponds to the number of unique tokens (6-mers) that the model can recognize in DNA sequences.</p>
</details>
</blockquote>
</li>
<li>
<p><code style="color: inherit">special_tokens</code>: Defines a set of special tokens used by the tokenizer:</p>
<ul>
<li><code style="color: inherit">unk_token: '[UNK]'</code> - Represents unknown or out-of-vocabulary tokens.</li>
<li><code style="color: inherit">sep_token: '[SEP]'</code> - Used to separate segments within a sequence.</li>
<li><code style="color: inherit">pad_token: '[PAD]'</code> - Used for padding sequences to a uniform length.</li>
<li><code style="color: inherit">cls_token: '[CLS]'</code> - Typically used as the first token in a sequence to represent the classification token.</li>
<li><code style="color: inherit">mask_token: '[MASK]'</code> - Used in masked language modeling to hide tokens that the model must predict.</li>
</ul>
</li>
</ul>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-11"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What do the other configuration parameters mean?</p>
<ol>
<li><code style="color: inherit">model_max_length=1000000000000000019884624838656</code></li>
<li><code style="color: inherit">is_fast=True</code></li>
<li><code style="color: inherit">padding_side='right'</code></li>
<li><code style="color: inherit">truncation_side='right'</code></li>
<li><code style="color: inherit">clean_up_tokenization_spaces=False</code></li>
<li><code style="color: inherit">added_tokens_decoder</code></li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-11"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-11" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>
<p><code style="color: inherit">model_max_length=1000000000000000019884624838656</code>: Represents the maximum length of sequences that the model can handle.</p>
<p>This extremely large value suggests that the model is designed to process very long sequences, although in practice, the actual limit will be constrained by available computational resources.</p>
</li>
<li><code style="color: inherit">is_fast=True</code>: Indicates that this tokenizer is optimized for speed, leveraging Rust-based implementations to accelerate tokenization processes.</li>
<li><code style="color: inherit">padding_side='right'</code>: Configures the tokenizer to pad sequences on the right side, ensuring that all sequences in a batch have the same length by adding padding tokens to the end of shorter sequences.</li>
<li><code style="color: inherit">truncation_side='right'</code>: Specifies that sequences will be truncated from the right side if they exceed the maximum length, preserving the beginning of the sequence.</li>
<li><code style="color: inherit">clean_up_tokenization_spaces=False</code>: Indicates that the tokenizer will not remove spaces after tokenization, preserving the original spacing in the text.</li>
<li><code style="color: inherit">added_tokens_decoder</code>: Maps token IDs to their corresponding <code style="color: inherit">AddedToken</code> objects, which include metadata such as whether the token is a special token and how it should be processed (e.g., stripping whitespace).</li>
</ol>
</blockquote>
</blockquote>
<p>This configuration ensures that the tokenizer is tailored to efficiently process DNA sequences, handling both the tokenization and padding/truncation of sequences in a manner that aligns with the model‚Äôs requirements.</p>
<p>By default, tokenizers may pad sequences on the right side (<code class="language-plaintext highlighter-rouge">padding_side='right'</code>). Let‚Äôs set the padding direction for the tokenizer.</p>


In [None]:
tokenizer.padding_side  = "left"

<p>When tokenizing a batch of sequences, shorter sequences will be padded with special tokens on the left to match the length of the longest sequence in the batch. This can be useful for ensuring consistent input sizes, especially in models that expect fixed-size inputs.</p>
<p>Let‚Äôs look at how some DNA sequences are encoded by the tokenizer. We start with a simple sequence ‚ÄúATT‚Äù:</p>


In [None]:
encoding = tokenizer("ATT", padding="longest", return_tensors="pt")
print(encoding)

<p>The code tokenizes the DNA sequence ‚ÄúATT‚Äù, pads it to the longest sequence in the batch (<code class="language-plaintext highlighter-rouge">padding="longest"</code>), and returns the result as PyTorch tensors (<code class="language-plaintext highlighter-rouge">return_tensors="pt"</code>).</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">{'input_ids': tensor([[   1, 2061,    2]]), 'token_type_ids': tensor([[0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1]])}
</code></pre></div></div>
<p>Here‚Äôs a breakdown of each output component:</p>
<ul>
<li><code style="color: inherit">input_ids</code>: A tensor containing the token IDs for the sequence. Each number corresponds to a specific token in the tokenizer‚Äôs vocabulary. In this case, <code style="color: inherit">[1, 2061, 2]</code> represents the tokens for the sequence:
<ul>
<li><code style="color: inherit">1</code>: the beginning of the sentence (<code class="language-plaintext highlighter-rouge">[CLS]</code>)</li>
<li><code style="color: inherit">2061</code>: the sentence itself (<code class="language-plaintext highlighter-rouge">ATT</code>)</li>
<li><code style="color: inherit">2</code>: the end of the sentence, a separator between sentence (<code class="language-plaintext highlighter-rouge">[SEP]</code>).</li>
</ul>
</li>
<li>
<p><code style="color: inherit">token_type_ids</code>: A tensor indicating the type of each token, often used in models that process multiple segments (e.g., question-answering). Here, all tokens are of type <code style="color: inherit">0</code>, suggesting a single segment.</p>
</li>
<li><code style="color: inherit">attention_mask</code>: A tensor that specifies which tokens should be attended to by the model (<code style="color: inherit">1</code> for real tokens, <code style="color: inherit">0</code> for padding). In this case, all tokens are valid, so the mask is <code style="color: inherit">[1, 1, 1]</code>.</li>
</ul>
<p>This encoded format is ready for input into a transformer model, ensuring that the sequence is correctly processed and understood by the model.</p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-12"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What is the encoding for ‚ÄúATTGTGGGTCCCCGTAGATGATAGGGGCCCCCC‚Äù? Specify that the tokenized sequence should have a maximum length of 5 tokens and ensure that the sequence is padded to the specified <code style="color: inherit">max_length</code> of 5 tokens.</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-12"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-12" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ul>
<li>To specify that the tokenized sequence should have a maximum length of 5 tokens, you need to put <code style="color: inherit">max_length=5</code> ‚Äì if the sequence is longer, it will be truncated ‚Äì</li>
<li>To ensure that the sequence is padded to the specified <code style="color: inherit">max_length</code> of 5 tokens, you need to add <code style="color: inherit">padding='max_length'</code> ‚Äì if the sequence is shorter, padding tokens will be added</li>
</ul>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">encoding = tokenizer("ATTGTGGGTCCCCGTAGATGATAGGGGCCCCCC", max_length=5, padding='max_length', truncation=True, return_tensors="pt")
print(encoding)
</code></pre></div>    </div>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">{'input_ids': tensor([[   1, 2061,  281,  485,    2]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
</code></pre></div>    </div>
<p>In this case, <code style="color: inherit">[1, 2061, 281, 485, 2]</code> represents the tokens for the sequence, likely including special tokens like [CLS] and [SEP]. As before, all tokens are of type <code style="color: inherit">0</code>, suggesting a single segment, and are valid, so the mask is <code style="color: inherit">[1, 1, 1, 1, 1]</code>.</p>
</details>
</blockquote>
<h1 id="prepare-data">Prepare data</h1>
<p>We will now prepare the data.</p>
<h2 id="load-data">Load data</h2>
<p>First we load the data. We will not use here the whole human genome because it comprises too many sequences. Instead, we use a small subset of the data, which is less than 1% of the sequences from the human genome.</p>
<blockquote class="comment" style="border: 2px solid #ffecc1; margin: 1em 0.2em">
<div class="box-title comment-title" id="comment-pre-trained-model-on-the-whole-human-genome"><i class="far fa-comment-dots" aria-hidden="true" ></i> Comment: Pre-trained model on the whole human genome</div>
<p>A compact DNA model with approximately 1 million parameters that has been trained on the entire human genome can be found on <a href="https://huggingface.co/RaphaelMourad/Mistral-DNA-v1-1M-hg38">Hugging Face</a></p>
</blockquote>
<p>We use the <code style="color: inherit">load_dataset</code> function from the <code style="color: inherit">datasets</code> library. This function is commonly used for loading data for Hugging Face Transformers.</p>


In [None]:
dataset_text = load_dataset("csv", data_files="data/genome_sequences/hg38/sequences_hg38_200b_verysmall.csv.gz")

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-13"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<ol>
<li>How is <code style="color: inherit">dataset_text</code> structured?</li>
<li>What are the first 5 train dataset in the data?</li>
<li>How long are the sequences?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-13"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-13" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li><code style="color: inherit">dataset_text</code> is a <code style="color: inherit">DatasetDict</code> with a <code style="color: inherit">train</code> <code style="color: inherit">Dataset</code> containing 1 feature (<code class="language-plaintext highlighter-rouge">'text'</code>) of 99,999 rows (obtained with <code style="color: inherit">dataset_text</code>)</li>
<li>
<p>To get the 5 train dataset in the data:</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">dataset_text['train']['text'][0:5]
</code></pre></div>        </div>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">['TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAA',
'CCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCC',
'TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCT',
'GAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGA',
'CACATGCTAGCGCGTCGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTACCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCACCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGC']
</code></pre></div>        </div>
</li>
<li>
<p>The sequences are 200 base pair long:</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">len(dataset_text['train']['text'][0])
</code></pre></div>        </div>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">200
</code></pre></div>        </div>
</li>
</ol>
</details>
</blockquote>
<h2 id="tokenize-data">Tokenize data</h2>
<p>Let‚Äôs tokenize the data. First, we create a function that tokenizes a text using the BPE letter tokenizer:</p>


In [None]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="longest", truncation=True, return_tensors="pt")

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-14"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What do the following parameters?</p>
<ol>
<li><code style="color: inherit">padding="longest"</code></li>
<li><code style="color: inherit">truncation=True</code></li>
<li><code style="color: inherit">return_tensors="pt"</code></li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-14"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-14" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li><code style="color: inherit">padding="longest"</code> ensures that all sequences in the batch are padded to the length of the longest sequence, adding padding tokens as needed.</li>
<li><code style="color: inherit">truncation=True</code> specifies that sequences exceeding the model‚Äôs maximum length will be truncated to fit.</li>
<li><code style="color: inherit">return_tensors="pt"</code> indicates that the output should be in the form of PyTorch tensors, suitable for use with PyTorch-based models.</li>
</ol>
</details>
</blockquote>
<p>We can now apply this function to the load dataset:</p>


In [None]:
dataset = dataset_text.map(tokenize_function, batched=True)

<p>It is quite fast for the almsot 100,000 sequence of length 200 bp.</p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-15"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<ol>
<li>How is <code style="color: inherit">dataset</code> structured?</li>
<li>What is in the first tokenized sequence of <code style="color: inherit">train</code> <code class="language-plaintext highlighter-rouge">Dataset</code>?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-15"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-15" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li><code style="color: inherit">dataset</code> is
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 99999
    })
})
</code></pre></div>        </div>
<p><code style="color: inherit">dataset</code> is a <code style="color: inherit">DatasetDict</code> with 1 <code style="color: inherit">train</code> <code style="color: inherit">Dataset</code> made of 99,999 rows and 4 features:</p>
<ul>
<li><code style="color: inherit">text</code>: The original text data before tokenization.</li>
<li><code style="color: inherit">input_ids</code>: The tokenized input data, represented as numerical IDs.</li>
<li><code style="color: inherit">token_type_ids</code>: Indicates the type of each token, useful for models that handle multiple segments.</li>
<li><code style="color: inherit">attention_mask</code>: Specifies which tokens should be attended to by the model (<code style="color: inherit">1</code> for real tokens, <code style="color: inherit">0</code> for padding).</li>
</ul>
</li>
<li>The first tokenized sequence of <code style="color: inherit">train</code> <code style="color: inherit">Dataset</code> (<code class="language-plaintext highlighter-rouge">dataset["train"][1]</code>) is a dictionary with:
<ul>
<li><code style="color: inherit">text</code>: 200 base pair sequence</li>
<li><code style="color: inherit">input_ids</code>: list of 49 numerical values, the token IDs.</li>
<li><code style="color: inherit">token_type_ids</code>: list 49 <code style="color: inherit">0</code></li>
<li><code style="color: inherit">attention_mask</code>: list of 7 <code style="color: inherit">0</code> (padding) and 42 <code style="color: inherit">1</code> (real tokens)</li>
</ul>
</li>
</ol>
</details>
</blockquote>
<h2 id="split-data">Split data</h2>
<p>We will now split data between training and validation sets randomly. This is a crucial step in machine learning to ensure the model can generalize to unseen data.</p>
<p>For that, 80% of the entire data will be used for the training set and the remaining 20% will go into the validation set. We first compute the size of training and validation sets:</p>


In [None]:
train_size = int(0.8 * len(dataset["train"]))
val_size = len(dataset["train"]) - train_size

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-16"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>How big are training and validation sets?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-16"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-16" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<p>Training set has 79,999 sequences and the validation set 20,000.</p>
</details>
</blockquote>
<p>To perform the actual splitting of the training dataset into two subsets, we use the <code style="color: inherit">torch.utils.data.random_split</code> function from the PyTorch library that randomly splits a dataset into subsets.</p>


In [None]:
train_set, val_set = torch.utils.data.random_split(dataset["train"], [train_size, val_size])

<h2 id="data-collation">Data Collation</h2>
<p>The <code style="color: inherit">DataCollatorForLanguageModeling</code> is a utility class, designed to prepare and format batches of data for language modeling tasks. It handles the dynamic padding and masking of input sequences, ensuring that each batch fed into the model is correctly formatted and optimized for training.</p>


In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-17"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What are the different parameters?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-17"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-17" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ul>
<li><code style="color: inherit">tokenizer=tokenizer</code> specifies the tokenizer to be used for processing the input data. The tokenizer converts raw text into numerical tokens that the model can understand.</li>
<li><code style="color: inherit">mlm=False</code>: Indicates that the data collator is set up for causal language modeling (CLM) rather than masked language modeling (MLM).</li>
</ul>
</details>
</blockquote>
<p>This will:</p>
<ol>
<li>Automatically pads sequences within a batch to ensure they are of equal length, which is necessary for efficient batch processing in neural networks.</li>
<li>Generates attention masks that indicate which tokens should be attended to by the model, ignoring padding tokens.</li>
<li>Collates individual examples into batches, handling the necessary formatting and ensuring compatibility with the model‚Äôs input requirements.</li>
</ol>
<p>The <code style="color: inherit">DataCollatorForLanguageModeling</code> is typically used in conjunction with a <code style="color: inherit">Trainer</code> from the Hugging Face library. It simplifies the data preparation process, allowing you to focus on model training and evaluation without worrying about the intricacies of batch formatting.</p>
<h1 id="train-the-model">Train the model</h1>
<h2 id="define-parameters-for-pretraining">Define parameters for pretraining</h2>
<p>We are now going to defines the hyperparameters and configurations for training the language model using the Hugging Face <code style="color: inherit">transformers</code>.</p>
<p>Before, we specify the batch size for training and evaluation. A batch size of 32 means that 32 samples will be processed before the model updates its weights. This size is chosen to balance computational efficiency and memory usage.</p>


In [None]:
batchsize=32
training_args = TrainingArguments(
  output_dir="./results/models",
  eval_strategy="epoch",
  save_strategy="epoch",
  num_train_epochs=50,
  per_device_train_batch_size=batchsize,
  per_device_eval_batch_size=batchsize,
  learning_rate=5e-4,
  weight_decay=0.01,
  logging_dir="./logs",
  load_best_model_at_end=True,
  bf16=True,
  gradient_accumulation_steps=50,
  report_to="none",
)

<ul>
<li><code style="color: inherit">output_dir="./results/models"</code>: directory where the training outputs, including model checkpoints and results, will be saved.</li>
<li><code style="color: inherit">eval_strategy="epoch"</code> indicates that the model‚Äôs performance will be evaluated at the end of each epoch, a complete pass through the entire training dataset. This allows for monitoring the model‚Äôs progress and adjusting the training process as needed.</li>
<li><code style="color: inherit">save_strategy="epoch"</code> specifies that the model will be saved at the end of each epoch. This ensures that checkpoints are available for each complete pass through the dataset.</li>
<li><code style="color: inherit">num_train_epochs=50</code> sets the total number of training epochs to 50. This means the model will iterate over the entire dataset 50 times, allowing it to learn and optimize over multiple passes.</li>
<li><code style="color: inherit">per_device_train_batch_size=batchsize</code> and <code style="color: inherit">per_device_eval_batch_size=batchsize</code> set the batch size for training and evaluation on each device (e.g., GPU) to 32. This ensures consistency in batch processing across different stages of training and evaluation.</li>
<li><code style="color: inherit">learning_rate=5e-4</code> defines the learning rate for the optimizer, set to \(5 \times 10^{-4}\). This rate controls the step size during gradient descent and is a common choice for pre-training models.</li>
<li><code style="color: inherit">weight_decay=0.01</code> applies L2 regularization to the model weights with a standard decay rate of 0.01. This helps prevent overfitting by penalizing large weights.</li>
<li><code style="color: inherit">logging_dir="./logs"</code> specifies the directory where training logs will be stored, allowing for monitoring and analysis of the training process.</li>
<li><code style="color: inherit">load_best_model_at_end=True</code> ensures that the best model, based on the lowest evaluation loss, is loaded at the end of training. This helps in selecting the model with the best performance across all epochs. During gradient descent, the model will be optimized, and at some point, the loss will start to increase again. We want to pick the model with the lowest loss, not when it starts increasing. So, ‚Äúload best model at the end‚Äù means selecting the model with the best loss across all epochs.</li>
<li><code style="color: inherit">fp16=True</code> enables mixed-precision training using 16-bit floating-point numbers. This reduces memory usage and can speed up training on compatible hardware.</li>
<li><code style="color: inherit">gradient_accumulation_steps=50</code> accumulates gradients over 50 steps before performing a backward pass. This effectively increases the batch size without requiring additional memory, helping to stabilize training.</li>
<li>
<p><code style="color: inherit">report_to="none"</code> disables <a href="https://wandb.ai/">Weights &amp; Biases (WandB)</a>, a popular platform used for experiment tracking, dataset versioning, and model management in machine learning</p>
<blockquote class="comment" style="border: 2px solid #ffecc1; margin: 1em 0.2em">
<div class="box-title comment-title" id="comment-why-disable-wandb"><i class="far fa-comment-dots" aria-hidden="true" ></i> Comment: Why Disable WandB?</div>
<p>Disabling WandB is often done in specific scenarios:</p>
<ul>
<li>Avoiding Unwanted Logging: If we do not intend to use WandB for tracking our experiments or if we want to avoid potential conflicts with other logging mechanisms, we would disable it.</li>
<li>Reducing Overhead: WandB logging can introduce some overhead,   particularly when dealing with large datasets or complex models. Disabling it can slightly improve performance if tracking is not essential.</li>
<li>Testing/Debugging: During testing or debugging, we might prefer to have more control over logging or we might want to avoid cluttering our WandB workspace with intermediate results.</li>
</ul>
</blockquote>
</li>
</ul>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-18"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What is stored in <code style="color: inherit">training_args</code>: the parameters to the model, the parameter of the LLM or the parameters of the trainer function?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-18"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-18" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<p>The parameters of the trainer function</p>
</details>
</blockquote>
<h2 id="pretrain-the-model">Pretrain the model</h2>
<p>Here is the most important part: the pre-training process. For this, we will use a <code style="color: inherit">Trainer</code> function. This function takes as input the model that we built previously, which has an architecture but no initialized weights.</p>


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_set,
    eval_dataset=val_set,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)

<p>The Trainer function also takes:</p>
<ul>
<li><code style="color: inherit">args</code>: the training arguments we configured earlier</li>
<li><code style="color: inherit">data_collator</code>: the data collator function feeding the tokenized data sequences to the model.</li>
<li><code style="color: inherit">train_dataset</code>: the training set, i.e. the data used for computing the gradients</li>
<li><code style="color: inherit">eval_dataset</code>: the validation set, i.e. the data used to assess the prediction accuracy at each epoch. It‚Äôs important to use a validation set that is independent of the training set to ensure unbiased evaluation.</li>
<li>
<p><code style="color: inherit">callbacks</code>: <code style="color: inherit">EarlyStoppingCallback</code> with a patience of three is used to monitor the training process.</p>
<p>During training, we minimize the loss at each step. However, at some point, the loss may start to increase again. We want to capture the model parameters when the loss reaches its minimum. By using a patience of three, we aim to mitigate the effects of noise during training. Noise can cause fluctuations in the loss, making it seem like we‚Äôve reached a local minimum when a better one might be found with further training.</p>
<p>With a patience of three, even if we find a good minimum, we wait for three more epochs to ensure that the loss does not improve further. If the loss does not decrease for three consecutive epochs, we stop training. However, if a better model with a lower loss is found within those three epochs, training continues. This approach helps in finding a more robust local minimum by reducing the impact of noise in the training data.</p>
</li>
</ul>
<p>Let‚Äôs launch the training with <code style="color: inherit">trainer.train()</code> method</p>


In [None]:
trainer.train()

<p>Here, the trainer is set to run for 50 epochs. After the initiation, we get an estimation of the time it takes per epoch to get an idea of the total training duration. Let‚Äôs run it for a bit to see how long it takes.</p>
<p>With this small model and dataset, the estimated time to run 50 epochs is 20 hours ‚Äì this value changes depending on the infrastructure ‚Äì.</p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-19"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>Will the model be trained to 50 epochs?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-19"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-19" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<p>Setting the number of epochs to 50 doesn‚Äôt mean the model will train for all 50 epochs. It‚Äôs likely to stop earlier</p>
</details>
</blockquote>
<p>The 50 epochs serve as a maximum limit. The model will stop training earlier if it reaches the minimum loss and then starts to increase again, thanks to the early stopping callback. This means the model might only require half the epochs, perhaps 25 epochs or 10 hours, to achieve optimal performance.</p>
<blockquote class="comment" style="border: 2px solid #ffecc1; margin: 1em 0.2em">
<div class="box-title comment-title" id="comment-don-t-train-until-the-end"><i class="far fa-comment-dots" aria-hidden="true" ></i> Comment: Don't train until the end</div>
<p>The idea here is not to train the model until completion, as it would take too much time.</p>
</blockquote>
<p>Let‚Äôs stop the actual training and cheat a bit by loading a previously <a href="https://huggingface.co/RaphaelMourad/Mistral-DNA-v1-1M-hg38">trained Mistral model</a>:</p>


In [None]:
model = AutoModelForCausalLM.from_pretrained("RaphaelMourad/Mistral-DNA-v1-17M-hg38")

<p>This is a mixed model that was pre-trained on the entire Human Genome. It contains approximately 17 million parameters and was trained using the Human Genome assembly GRCh38. Unlike models pre-trained on sequences of 200 bases, this model was pre-trained on sequences of 10,000 bases (10K). The advantage of this model is its ability to process larger DNA contexts or sequences. This capability allows it to capture more extensive patterns and dependencies within the genomic data.</p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-20"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>By looking at the output of:</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">model
</code></pre></div>  </div>
<ol>
<li>How many transformer layers does this model have?</li>
<li>Is it similar to previous model?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-20"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-20" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>8 transformer layers</li>
<li>Yes</li>
</ol>
</details>
</blockquote>
<h1 id="compute-the-embedding-of-a-dna-sequence">Compute the embedding of a DNA sequence</h1>
<p>With this kind of model something, we can convert the DNA sequence to a vector.</p>
<p>Let‚Äôs:</p>
<ol>
<li>Take a DNA sequence</li>
<li>Tokenizes the DNA sequence using the tokenizer created before</li>
<li>Extracts the tensor containing the token IDs from the tokenized output</li>
<li>Passes the tokenized input through the model.</li>
<li>Extracts the hidden states from the model‚Äôs output.</li>
</ol>


In [None]:
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
tokenized_dna = tokenizer(dna, return_tensors = 'pt')
inputs = tokenized_dna["input_ids"]
model_outputs = model(inputs)
hidden_states = model_outputs[0]

<p>The generated hidden states are the internal representations of the input sequence at different layers of the model. Here we look at the hidden neurons of the last layer. They capture contextual information about the sequence and provide a richer representation of the sequence compared to the raw nucleotide string, capturing contextual information that can be used for tasks such as sequence similarity analysis, functional prediction, variant impact analysis, and more.</p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-21"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What is the shape of <code style="color: inherit">hidden_states</code>?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-21"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-21" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<p><code style="color: inherit">[1, 17, 4096]</code>:</p>
<ul>
<li><code style="color: inherit">1</code>: number of sequences, here 1 DNA sequence</li>
<li><code style="color: inherit">17</code>: number of words, here the DNA sequence has been converted to 17 words larger that 1</li>
<li><code style="color: inherit">4096</code>: size of the vocabulary, the number of possible tokens</li>
</ul>
</details>
</blockquote>
<p>We would like now to calculate the mean of the hidden states across a specific dimension, here the first layer of the model (<code class="language-plaintext highlighter-rouge">hidden_states[0]</code>):</p>


In [None]:
embedding_mean = torch.mean(hidden_states[0], dim=0)

<p><code class="language-plaintext highlighter-rouge">dim=0</code> indicates that the mean is calculated across the sequence length dimension. This effectively averages the hidden states for each token position in the sequence, resulting in a single vector that represents the entire sequence.</p>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-22"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<ol>
<li>What is the shape of <code style="color: inherit">embedding_mean</code>?</li>
<li>Which type of data is in <code style="color: inherit">embedding_mean</code>?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-22"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-22" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li><code style="color: inherit">4096</code>, the number of possible tokens.</li>
<li><code style="color: inherit">embedding_mean</code> is a vector of numerical values.</li>
</ol>
</details>
</blockquote>
<p><code style="color: inherit">embedding_mean</code> is a numerical vector of size 4,096 that represents the average embedding of the DNA sequence. This fixed-size representation can be used for various downstream tasks, such as classification, clustering, or similarity comparisons.</p>
<blockquote class="hands-on">
<div class="box-title hands-on-title" id="hands-on"><i class="fas fa-pencil-alt" aria-hidden="true" ></i> Hands On</div>
<p>Apply a max pooling instead of a mean pooling to summarize information along the DNA sequence.</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-23"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-23" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">embedding_max = torch.max(hidden_states[0], dim=0)[0]
</code></pre></div>    </div>
</blockquote>
</blockquote>
<blockquote class="comment" style="border: 2px solid #ffecc1; margin: 1em 0.2em">
<div class="box-title comment-title" id="comment-similar-process-to-chatgpt"><i class="far fa-comment-dots" aria-hidden="true" ></i> Comment: Similar process to ChatGPT</div>
<p>When you use a system like ChatGPT, the process involves converting your textual input, or ‚Äúprompt,‚Äù into a numerical vector. This conversion is similar to the process we just did. Here‚Äôs how it works:</p>
<ul>
<li><strong>Input Prompt</strong>: You write a prompt, which is a textual query or statement.</li>
<li><strong>Tokenization</strong>: The prompt is tokenized, meaning it is broken down into smaller units, such as words or subwords, using a tokenizer.</li>
<li><strong>Vector Representation</strong>: These tokens are then converted into numerical vectors, or embeddings. These vectors capture the semantic meaning and context of the words in the prompt.</li>
<li><strong>Model Processing</strong>: The model processes these vectors to generate a response. The embeddings allow the model to understand the context and nuances of your input, enabling it to produce coherent and relevant responses.</li>
</ul>
<p>This process of converting text into numerical vectors is fundamental to how language models like ChatGPT operate, enabling them to interpret and generate human-like text based on the input they receive.</p>
</blockquote>
<h1 id="conclusion">Conclusion</h1>
<p>This tutorial provides a comprehensive guide to preparing, training, and utilizing a pre-trained language model for DNA sequence analysis. It begins by setting up the necessary resources, including installing dependencies, importing Python libraries, and configuring computational resources. The tutorial then walks through loading and choosing an appropriate model architecture for DNA sequences, followed by setting up a tokenizer to convert DNA sequences into numerical tokens. Data preparation involves loading, tokenizing, splitting, and collating DNA sequences to ensure efficient model training. The training process is detailed with parameter definitions and pretraining steps, culminating in the calculation of DNA sequence embeddings.</p>
<p>We can now leverage the pre-trained model in various bioinformatics applications, such as sequence similarity analysis and functional prediction, offering a robust foundation for integrative biological research.</p>


# Key Points

- Efficient Model Training: By leveraging parameter-efficient fine-tuning techniques and distributed training strategies, it is possible to train large language models on DNA sequences using consumer-grade hardware, making advanced bioinformatics research more accessible.
- Importance of Data Preparation: Properly tokenizing and organizing DNA sequence data is crucial for effective model training and evaluation, as it directly impacts the model's ability to learn and generalize from the data.
- Practical Applications of Embeddings: The embeddings generated by a trained language model capture rich contextual information about DNA sequences, enabling a wide range of downstream applications, from sequence classification to functional prediction in genomics research.

# Congratulations on successfully completing this tutorial!

Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/statistics/tutorials/genomic-llm-pretraining/tutorial.html#feedback) and check there for further resources!
