{ "metadata": {}, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
\n\n# Pretraining a Large Language Model (LLM) from Scratch on DNA Sequences\n\nby [Raphael Mourad](https://training.galaxyproject.org/hall-of-fame/raphaelmourad/), [Bérénice Batut](https://training.galaxyproject.org/hall-of-fame/bebatut/)\n\nCC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)\n\n**Objectives**\n\n- How to load and configure a pre-trained language model for DNA sequence analysis?\n- What is the process for tokenizing DNA sequences to prepare them for model training?\n- How to split and organize DNA sequence dataset for effective model training and evaluation?\n- What are the key hyperparameters to consider when pretraining a language model on DNA sequences, and how to configure them?\n- How to use a trained language model to generate and interpret embeddings for DNA sequences?\n\n**Objectives**\n\n- Identify and load a pre-trained language model (LLM) suitable for DNA sequence analysis.\n- Explain the role of a tokenizer in converting DNA sequences into numerical tokens for model processing.\n- Prepare and tokenize DNA sequence datasets for model training and evaluation.\n- Configure and implement data collation to organize tokenized data into batches for efficient training.\n- Define and configure hyperparameters for pretraining a model, such as learning rate and batch size.\n- Monitor and evaluate the model's performance during training to ensure effective learning.\n- Use the trained model to generate embeddings for DNA sequences and interpret these embeddings for downstream bioinformatics applications.\n- Develop a complete workflow for training a language model on DNA sequences, from data preparation to model evaluation, and apply it to real-world bioinformatics tasks.\n\n**Time Estimation: 3H**\n
\n", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-0", "source": "

Generative Artificial Intelligence (AI) represents a cutting-edge domain within machine learning, focused on creating new, synthetic yet realistic data. This includes generating text, images, music, and even biological sequences. At the heart of many generative AI applications are Large Language Models (LLMs), which have revolutionized natural language processing and beyond.

\n

LLMs are sophisticated neural networks trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on Transformers, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery.

\n
\n
\n

Transformers are a type of neural network model designed to handle sequential data, such as text, by using self-attention mechanisms to weigh the importance of input elements relative to each other, enabling the model to understand and generate coherent and contextually relevant outputs.

\n
\n

In this tutorial, we will explore the intersection of generative AI and genomics by pretraining an LLM from scratch on DNA sequences. This process will equip the model with a foundational understanding of the “grammar” of DNA, enabling it to generate and analyze genetic data with remarkable accuracy.

\n

Mistral AI, French artificial intelligence (AI) startup, recently launched large language models (LLMs) showing performances superior to Llama2. In particular, Mixtral-8x7B implements:

\n\n

These techniques collectively enhance the performance and efficiency of large language models, enabling them to process and generate text more effectively.

\n

In this tutorial, we will use a simplified Mistral model architecture with fewer layers and hidden units to reduce computational requirements. The model will be trained to predict the next base in the sequence. For instance, for a sequence like ATTTGTTGGT, the model will be trained to predict the suffix TTGGT given the prefix ATTTG. This process is called causal language modeling.

\n

To pretrain the model, we will use a file containing 100,000 non-overlapping DNA sequences of 200 bases, corresponding to around 1% of the human genome (hg38 assembly). This involves training the model to predict the end of a DNA sequence.

\n

By the end of this tutorial, we will obtain a Mistral-DNA model with an internal representation of DNA sequence grammar. This pretrained model can then be used for various applications, such as fine-tuning for classification tasks or predicting mutational effects.

\n
\n
Agenda
\n

In this tutorial, we will cover:

\n
    \n
  1. Prepare resources
      \n
    1. Install dependencies
    2. \n
    \n
  2. \n
\n
\n

Prepare resources

\n

To pretrain the model, let’s open a Notebook or a Python script.

\n

Install dependencies

\n

The first step is to install the required dependencies:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-1", "source": [ "!pip install accelerate\n", "!pip install datasets==3.0.1\n", "!pip install transformers\n", "!pip install torch\n", "!pip install flash-attn" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-2", "source": "
\n
Question
\n

What are the required dependencies doing?

\n
👁 View solution\n
\n\n
\n
\n

Import Python libraries

\n

Let’s now import them.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-3", "source": [ "import os\n", "\n", "import accelerate\n", "import flash_attn\n", "import torch\n", "import transformers\n", "from datasets import load_dataset\n", "from transformers import (\n", " AutoConfig,\n", " AutoModelForCausalLM,\n", " AutoTokenizer,\n", " DataCollatorForLanguageModeling,\n", " EarlyStoppingCallback,\n", " Trainer,\n", " TrainingArguments,\n", ")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-4", "source": "
\n
\n\n

These components work together to streamline the process of training and fine-tuning transformer models for various NLP tasks.

\n
\n
\n
Comment: Versions
\n

This tutorial has been tested with following versions:

\n\n

You can check the versions with:

\n
accelerate.__version__\nflash_attn.__version__\ntransformers.__version__\n
\n
\n

Check and configure available resources

\n

To pretrain the model, we need to specific resources:

\n\n

Let’s check the resources:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-5", "source": [ "!nvidia-smi" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-6", "source": "

The command nvidia-smi (NVIDIA System Management Interface) is used to monitor and manage NVIDIA GPU devices. It provides information about the GPU’s utilization, memory usage, temperature, and running processes. This tool is essential for developers and researchers to track the performance and health of GPUs, especially when running computationally intensive tasks like machine learning training.

\n
\n
Question
\n

How do you interpret the following output?

\n
Tue Mar 25 13:49:35 2025\n+-----------------------------------------------------------------------------> ------------+\n| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA > Version: 12.4     |\n|-----------------------------------------+------------------------> +----------------------+\n| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile > Uncorr. ECC |\n| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | > GPU-Util  Compute M. |\n|                                         |                        |          >      MIG M. |\n|=========================================+========================> +======================|\n|   0  Tesla T4                       Off |   00000000:00:04.0 > Off |                    0 |\n| N/A   40C    P8              9W /   70W |       2MiB /  15360MiB |      > 0%      Default |\n|                                         |                        |          >         N/A |\n+-----------------------------------------+------------------------> +----------------------+\n                                                                              >\n+-----------------------------------------------------------------------------> ------------+\n| > Processes:                                                                    >           |\n|  GPU   GI   CI        PID   Type   Process > name                              GPU Memory |\n|        ID   > ID                                                               Usage      |\n|> ==============================================================================> ===========|\n|  No running processes > found                                                             |\n+-----------------------------------------------------------------------------> ------------+\n
\n
👁 View solution\n
\n\n
\n
\n

Let’s configure PyTorch and the CUDA environment – software and hardware ecosystem provided by NVIDIA to enable parallel computing on GPU – to optimize GPU memory usage and performance:

\n
    \n
  1. \n

    Enables CuDNN benchmarking in PyTorch:

    \n
     torch.backends.cudnn.benchmark=True\n
    \n
    \n
    Question
    \n
      \n
    1. What is CuDNN?
    2. \n
    3. Why enabling benchmarking?
    4. \n
    \n
    👁 View solution\n
    \n
      \n
    1. CuDNN is a GPU-accelerated library for deep neural networks.
    2. \n
    3. Enabling benchmarking allows CuDNN to select the fastest algorithms for the specific GPU and input size. This can improve the performance of the model, especially for fixed-size inputs.
    4. \n
    \n
    \n\n
  2. \n
  3. \n

    Set an environment variable that configures how PyTorch manages CUDA memory allocations

    \n
     os.environ[\"PYTORCH_CUDA_ALLOC_CONF\"] = \"max_split_size_mb:32\"\n
    \n
    \n
    Question
    \n

    What is this command doing?

    \n
    👁 View solution\n
    \n

    It sets the maximum split size for memory allocations to 32 megabytes. This can help reduce memory fragmentation and improve memory utilization, which is particularly useful when working with large models or limited GPU memory.

    \n
    \n\n
  4. \n
\n

Prepare the model

\n

Load the model

\n

Let’s load now the model, Mistral-DNA. The Mixtral model (Mixtral-8x7B-v0.1) – a pretrained generative Sparse Mixture of Experts outperforming Llama 2 70B – was modified to significantly reduce the number of parameters mostly by removing layers, such that it could be trained on a GPU such as an RTX3090.

\n

We will get the model from GitHub:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-7", "source": [ "!git clone https://github.com/raphaelmourad/Mistral-DNA.git" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-8", "source": "

Let’s check if we have the model now:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-9", "source": [ "!ls" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-10", "source": "

We should get two folders: Mistral-DNA and sample_data. Let’s change the current working directory to Mistral-DNA/:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-11", "source": [ "os.chdir(\"Mistral-DNA/\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-12", "source": "

Choose the LLM architecture

\n

Let’s look at the original archicture of Mixtral-8x7B-v0.1 which is stored in the data/models/Mixtral-8x7B-v0.1 folder (GitHub).

\n
\n
Question
\n
    \n
  1. Which file is essential for the configuring the language model?
  2. \n
  3. What are the key parameters of the simplified architecture used here?
  4. \n
\n
👁 View solution\n
\n
    \n
  1. The config.json file is essential for configuring the language model as a Mistral model. It specifies the architecture for causal language modeling (MixtralForCausalLM) and details the size of the neural network components. The original Mistral model has a larger hidden size, but it is reduced here to make pre-training feasible.
  2. \n
  3. The key parameters are:\n
      \n
    • Intermediate Size (intermediate_size): Size of the intermediate (or hidden) layers within the model. It determines the number of neurons in these layers, influencing the model’s capacity to capture complex patterns in the data. A larger intermediate size can capture more nuanced details but also requires more computational resources. Set to 256, which is relatively small compared to the original model.
    • \n
    • Number of Attention Heads (num_attention_heads): Number of attention heads in the multi-head attention mechanism. Each head allows the model to focus on different parts of the input sequence simultaneously, capturing diverse aspects of the data. More attention heads can provide a richer representation but also increase computational complexity. Reduced to 8 for efficiency.
    • \n
    • Number of Experts per token (num_experts_per_tok): Specific to models that use a Mixture of Experts (MoE) architecture. It indicates the number of expert networks that are activated for each token in the input sequence. Experts are specialized sub-networks that handle different parts of the data, improving efficiency and performance, especially for large models. Set to 1 expert per token.
    • \n
    • Number of Local Experts (num_local_experts): Number of local experts available in the model. Local experts are a subset of the total experts and are used to process specific parts of the input data. This localization can help in managing computational resources more effectively, especially when dealing with large-scale data. Set to 64.
    • \n
    • Vocabulary Size (vocab_size): Specifically designed for DNA sequences, with a size of \\(4,096 = 4^6\\), as DNA consists of four possible letters (A, T, C, and G) and the words are 6-mers (sequences of six nucleotides). By modeling DNA using 6-mers, we capture meaningful patterns within the genetic sequence, enabling the model to understand and generate DNA data effectively.
    • \n
    \n
  4. \n
\n
\n
\n

Let’s load the configuration of the pre-trained model:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-13", "source": [ "config = AutoConfig.from_pretrained(\"data/models/Mixtral-8x7B-v0.1\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-14", "source": "

By loading the configuration, we can inspect or modify the model’s architecture without loading the actual model weights. Let’s now initialize a causal language model from the loaded configuration object, with a specific attention implementation:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-15", "source": [ "model = AutoModelForCausalLM.from_config(config, attn_implementation=\"eager\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-16", "source": "
\n
Question
\n

What does attn_implementation=\"eager\" do?

\n
👁 View solution\n
\n

attn_implementation=\"eager\" specifies the attention implementation to use. Setting it to “eager” means that the attention mechanism will be executed eagerly, which can be useful for debugging or when working with dynamic computation graphs. Eager execution runs operations immediately as they are called in Python, rather than adding them to a graph for later execution.

\n
\n
\n

How does the model look like?

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-17", "source": [ "model" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-18", "source": "
MixtralForCausalLM(\n  (model): MixtralModel(\n    (embed_tokens): Embedding(4096, 256)\n    (layers): ModuleList(\n      (0-7): 8 x MixtralDecoderLayer(\n        (self_attn): MixtralAttention(\n          (q_proj): Linear(in_features=256, out_features=256, bias=False)\n          (k_proj): Linear(in_features=256, out_features=256, bias=False)\n          (v_proj): Linear(in_features=256, out_features=256, bias=False)\n          (o_proj): Linear(in_features=256, out_features=256, bias=False)\n          (rotary_emb): MixtralRotaryEmbedding()\n        )\n        (block_sparse_moe): MixtralSparseMoeBlock(\n          (gate): Linear(in_features=256, out_features=64, bias=False)\n          (experts): ModuleList(\n            (0-63): 64 x MixtralBlockSparseTop2MLP(\n              (w1): Linear(in_features=256, out_features=256, bias=False)\n              (w2): Linear(in_features=256, out_features=256, bias=False)\n              (w3): Linear(in_features=256, out_features=256, bias=False)\n              (act_fn): SiLU()\n            )\n          )\n        )\n        (input_layernorm): MixtralRMSNorm((256,), eps=1e-05)\n        (post_attention_layernorm): MixtralRMSNorm((256,), eps=1e-05)\n      )\n    )\n    (norm): MixtralRMSNorm((256,), eps=1e-05)\n  )\n  (lm_head): Linear(in_features=256, out_features=4096, bias=False)\n)\n
\n

As expected, the model is a MixtralForCausalLM model with several key components:

\n
    \n
  1. \n

    Embedding Layer (embed_tokens): Converts input DNA sequences into dense vectors of fixed size. It maps each of the 4,096 (\\(4^{6}\\)) possible DNA tokens (representing 6-mers) to a 256-dimensional vector space. This embedding layer is crucial for transforming discrete DNA sequences into a format suitable for neural network processing.

    \n
  2. \n
  3. Decoder Layers (layers): Consists of eight MixtralDecoderLayer modules, each containing several sub-components:\n\n
  4. \n
  5. \n

    Final Layer Normalization (norm): Applies normalization to the output of the final decoder layer, ensuring stable and consistent outputs.

    \n
  6. \n
  7. Language Model Head (lm_head): Projects the 256-dimensional output of the final decoder layer back into the 4,096-dimensional vocabulary space of DNA tokens. This linear layer (Linear) maps the hidden states to the original token space, enabling the model to predict the next DNA token accurately.
  8. \n
\n

This architecture ensures that the model can capture complex patterns in DNA sequences while maintaining computational efficiency, making it suitable for tasks like DNA sequence generation and analysis. The model’s design culminates in the output of 4,096 tokens, aligning with the input dimension. This consistency is crucial for accurately predicting the next token in a given DNA sequence, ensuring that the model’s predictions are coherent and reliable.

\n
\n
Question
\n

How many parameters are in this model?

\n
👁 View solution\n
\n
pytorch_total_params = sum(p.numel() for p in model.parameters())\nprint(f\"Model size: {pytorch_total_params/1000**2:.1f}M parameters\")\n
\n

There are 105 millions parameters. It is a big model.

\n
\n\n

Prepare the tokenizer

\n

A tokenizer is a crucial component in natural language processing (NLP) that transforms raw text into a format that can be processed by machine learning models. In this section, we will load and configure the Byte-Pair Encoding (BPE) letter tokenizer. The BPE tokenizer efficiently handles rare and unknown words by breaking them down into frequent subword units, ensuring that the model can generalize better to unseen data. This process involves initializing the tokenizer with a predefined vocabulary and settings, enabling it to convert text into a format suitable for neural network processing. By doing so, we prepare the tokenizer to effectively manage DNA sequences, facilitating accurate and reliable model predictions.

\n

Let’s loads a pre-trained tokenizer from the Hugging Face Model Hub. The tokenizer is associated with the model DNABERT-2-117M, which is designed for processing DNA sequences.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-19", "source": [ "tokenizer = AutoTokenizer.from_pretrained(\"zhihan1996/DNABERT-2-117M\", trust_remote_code=True)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-20", "source": "
\n
Question
\n

What does the above command do?

\n
👁 View solution\n
\n\n
\n
\n

Let’s look at the created tokenizer now:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-21", "source": [ "print(tokenizer)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-22", "source": "
PreTrainedTokenizerFast(name_or_path='zhihan1996/DNABERT-2-117M',vocab_size=4096, model_max_length=1000000000000000019884624838656,is_fast=True, padding_side='right', truncation_side='right',special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': [PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'},clean_up_tokenization_spaces=False, added_tokens_decoder={\n\t0: AddedToken(\"[UNK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t1: AddedToken(\"[CLS]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t2: AddedToken(\"[SEP]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t3: AddedToken(\"[PAD]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t4: AddedToken(\"[MASK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n}\n)\n
\n

The PreTrainedTokenizerFast is a fast and efficient tokenizer used to process text data for the DNABERT-2-117M model. Here’s a breakdown of its configuration:

\n\n
\n
Question
\n

What do the other configuration parameters mean?

\n
    \n
  1. model_max_length=1000000000000000019884624838656
  2. \n
  3. is_fast=True
  4. \n
  5. padding_side='right'
  6. \n
  7. truncation_side='right'
  8. \n
  9. clean_up_tokenization_spaces=False
  10. \n
  11. added_tokens_decoder
  12. \n
\n
👁 View solution\n
\n
    \n
  1. \n

    model_max_length=1000000000000000019884624838656: Represents the maximum length of sequences that the model can handle.

    \n

    This extremely large value suggests that the model is designed to process very long sequences, although in practice, the actual limit will be constrained by available computational resources.

    \n
  2. \n
  3. is_fast=True: Indicates that this tokenizer is optimized for speed, leveraging Rust-based implementations to accelerate tokenization processes.
  4. \n
  5. padding_side='right': Configures the tokenizer to pad sequences on the right side, ensuring that all sequences in a batch have the same length by adding padding tokens to the end of shorter sequences.
  6. \n
  7. truncation_side='right': Specifies that sequences will be truncated from the right side if they exceed the maximum length, preserving the beginning of the sequence.
  8. \n
  9. clean_up_tokenization_spaces=False: Indicates that the tokenizer will not remove spaces after tokenization, preserving the original spacing in the text.
  10. \n
  11. added_tokens_decoder: Maps token IDs to their corresponding AddedToken objects, which include metadata such as whether the token is a special token and how it should be processed (e.g., stripping whitespace).
  12. \n
\n
\n\n

This configuration ensures that the tokenizer is tailored to efficiently process DNA sequences, handling both the tokenization and padding/truncation of sequences in a manner that aligns with the model’s requirements.

\n

By default, tokenizers may pad sequences on the right side (padding_side='right'). Let’s set the padding direction for the tokenizer.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-23", "source": [ "tokenizer.padding_side = \"left\"" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-24", "source": "

When tokenizing a batch of sequences, shorter sequences will be padded with special tokens on the left to match the length of the longest sequence in the batch. This can be useful for ensuring consistent input sizes, especially in models that expect fixed-size inputs.

\n

Let’s look at how some DNA sequences are encoded by the tokenizer. We start with a simple sequence “ATT”:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-25", "source": [ "encoding = tokenizer(\"ATT\", padding=\"longest\", return_tensors=\"pt\")\n", "print(encoding)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-26", "source": "

The code tokenizes the DNA sequence “ATT”, pads it to the longest sequence in the batch (padding=\"longest\"), and returns the result as PyTorch tensors (return_tensors=\"pt\").

\n
{'input_ids': tensor([[   1, 2061,    2]]), 'token_type_ids': tensor([[0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1]])}\n
\n

Here’s a breakdown of each output component:

\n\n

This encoded format is ready for input into a transformer model, ensuring that the sequence is correctly processed and understood by the model.

\n
\n
Question
\n

What is the encoding for “ATTGTGGGTCCCCGTAGATGATAGGGGCCCCCC”? Specify that the tokenized sequence should have a maximum length of 5 tokens and ensure that the sequence is padded to the specified max_length of 5 tokens.

\n
👁 View solution\n
\n\n
encoding = tokenizer(\"ATTGTGGGTCCCCGTAGATGATAGGGGCCCCCC\", max_length=5, padding='max_length', truncation=True, return_tensors=\"pt\")\nprint(encoding)\n
\n
{'input_ids': tensor([[   1, 2061,  281,  485,    2]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}\n
\n

In this case, [1, 2061, 281, 485, 2] represents the tokens for the sequence, likely including special tokens like [CLS] and [SEP]. As before, all tokens are of type 0, suggesting a single segment, and are valid, so the mask is [1, 1, 1, 1, 1].

\n
\n
\n

Prepare data

\n

We will now prepare the data.

\n

Load data

\n

First we load the data. We will not use here the whole human genome because it comprises too many sequences. Instead, we use a small subset of the data, which is less than 1% of the sequences from the human genome.

\n
\n
Comment: Pre-trained model on the whole human genome
\n

A compact DNA model with approximately 1 million parameters that has been trained on the entire human genome can be found on Hugging Face

\n
\n

We use the load_dataset function from the datasets library. This function is commonly used for loading data for Hugging Face Transformers.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-27", "source": [ "dataset_text = load_dataset(\"csv\", data_files=\"data/genome_sequences/hg38/sequences_hg38_200b_verysmall.csv.gz\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-28", "source": "
\n
Question
\n
    \n
  1. How is dataset_text structured?
  2. \n
  3. What are the first 5 train dataset in the data?
  4. \n
  5. How long are the sequences?
  6. \n
\n
👁 View solution\n
\n
    \n
  1. dataset_text is a DatasetDict with a train Dataset containing 1 feature ('text') of 99,999 rows (obtained with dataset_text)
  2. \n
  3. \n

    To get the 5 train dataset in the data:

    \n
    dataset_text['train']['text'][0:5]\n
    \n
    ['TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAA',\n'CCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCC',\n'TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCT',\n'GAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGA',\n'CACATGCTAGCGCGTCGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTACCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCACCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGC']\n
    \n
  4. \n
  5. \n

    The sequences are 200 base pair long:

    \n
    len(dataset_text['train']['text'][0])\n
    \n
    200\n
    \n
  6. \n
\n
\n
\n

Tokenize data

\n

Let’s tokenize the data. First, we create a function that tokenizes a text using the BPE letter tokenizer:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-29", "source": [ "def tokenize_function(examples):\n", " return tokenizer(examples['text'], padding=\"longest\", truncation=True, return_tensors=\"pt\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-30", "source": "
\n
Question
\n

What do the following parameters?

\n
    \n
  1. padding=\"longest\"
  2. \n
  3. truncation=True
  4. \n
  5. return_tensors=\"pt\"
  6. \n
\n
👁 View solution\n
\n
    \n
  1. padding=\"longest\" ensures that all sequences in the batch are padded to the length of the longest sequence, adding padding tokens as needed.
  2. \n
  3. truncation=True specifies that sequences exceeding the model’s maximum length will be truncated to fit.
  4. \n
  5. return_tensors=\"pt\" indicates that the output should be in the form of PyTorch tensors, suitable for use with PyTorch-based models.
  6. \n
\n
\n
\n

We can now apply this function to the load dataset:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-31", "source": [ "dataset = dataset_text.map(tokenize_function, batched=True)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-32", "source": "

It is quite fast for the almsot 100,000 sequence of length 200 bp.

\n
\n
Question
\n
    \n
  1. How is dataset structured?
  2. \n
  3. What is in the first tokenized sequence of train Dataset?
  4. \n
\n
👁 View solution\n
\n
    \n
  1. dataset is\n
    DatasetDict({\n    train: Dataset({\n        features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],\n        num_rows: 99999\n    })\n})\n
    \n

    dataset is a DatasetDict with 1 train Dataset made of 99,999 rows and 4 features:

    \n
      \n
    • text: The original text data before tokenization.
    • \n
    • input_ids: The tokenized input data, represented as numerical IDs.
    • \n
    • token_type_ids: Indicates the type of each token, useful for models that handle multiple segments.
    • \n
    • attention_mask: Specifies which tokens should be attended to by the model (1 for real tokens, 0 for padding).
    • \n
    \n
  2. \n
  3. The first tokenized sequence of train Dataset (dataset[\"train\"][1]) is a dictionary with:\n
      \n
    • text: 200 base pair sequence
    • \n
    • input_ids: list of 49 numerical values, the token IDs.
    • \n
    • token_type_ids: list 49 0
    • \n
    • attention_mask: list of 7 0 (padding) and 42 1 (real tokens)
    • \n
    \n
  4. \n
\n
\n
\n

Split data

\n

We will now split data between training and validation sets randomly. This is a crucial step in machine learning to ensure the model can generalize to unseen data.

\n

For that, 80% of the entire data will be used for the training set and the remaining 20% will go into the validation set. We first compute the size of training and validation sets:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-33", "source": [ "train_size = int(0.8 * len(dataset[\"train\"]))\n", "val_size = len(dataset[\"train\"]) - train_size" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-34", "source": "
\n
Question
\n

How big are training and validation sets?

\n
👁 View solution\n
\n

Training set has 79,999 sequences and the validation set 20,000.

\n
\n
\n

To perform the actual splitting of the training dataset into two subsets, we use the torch.utils.data.random_split function from the PyTorch library that randomly splits a dataset into subsets.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-35", "source": [ "train_set, val_set = torch.utils.data.random_split(dataset[\"train\"], [train_size, val_size])" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-36", "source": "

Data Collation

\n

The DataCollatorForLanguageModeling is a utility class, designed to prepare and format batches of data for language modeling tasks. It handles the dynamic padding and masking of input sequences, ensuring that each batch fed into the model is correctly formatted and optimized for training.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-37", "source": [ "data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-38", "source": "
\n
Question
\n

What are the different parameters?

\n
👁 View solution\n
\n\n
\n
\n

This will:

\n
    \n
  1. Automatically pads sequences within a batch to ensure they are of equal length, which is necessary for efficient batch processing in neural networks.
  2. \n
  3. Generates attention masks that indicate which tokens should be attended to by the model, ignoring padding tokens.
  4. \n
  5. Collates individual examples into batches, handling the necessary formatting and ensuring compatibility with the model’s input requirements.
  6. \n
\n

The DataCollatorForLanguageModeling is typically used in conjunction with a Trainer from the Hugging Face library. It simplifies the data preparation process, allowing you to focus on model training and evaluation without worrying about the intricacies of batch formatting.

\n

Train the model

\n

Define parameters for pretraining

\n

We are now going to defines the hyperparameters and configurations for training the language model using the Hugging Face transformers.

\n

Before, we specify the batch size for training and evaluation. A batch size of 32 means that 32 samples will be processed before the model updates its weights. This size is chosen to balance computational efficiency and memory usage.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-39", "source": [ "batchsize=32\n", "training_args = TrainingArguments(\n", " output_dir=\"./results/models\",\n", " eval_strategy=\"epoch\",\n", " save_strategy=\"epoch\",\n", " num_train_epochs=50,\n", " per_device_train_batch_size=batchsize,\n", " per_device_eval_batch_size=batchsize,\n", " learning_rate=5e-4,\n", " weight_decay=0.01,\n", " logging_dir=\"./logs\",\n", " load_best_model_at_end=True,\n", " bf16=True,\n", " gradient_accumulation_steps=50,\n", " report_to=\"none\",\n", ")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-40", "source": "\n
\n
Question
\n

What is stored in training_args: the parameters to the model, the parameter of the LLM or the parameters of the trainer function?

\n
👁 View solution\n
\n

The parameters of the trainer function

\n
\n
\n

Pretrain the model

\n

Here is the most important part: the pre-training process. For this, we will use a Trainer function. This function takes as input the model that we built previously, which has an architecture but no initialized weights.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-41", "source": [ "trainer = Trainer(\n", " model=model,\n", " args=training_args,\n", " data_collator=data_collator,\n", " train_dataset=train_set,\n", " eval_dataset=val_set,\n", " callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]\n", ")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-42", "source": "

The Trainer function also takes:

\n\n

Let’s launch the training with trainer.train() method

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-43", "source": [ "trainer.train()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-44", "source": "

Here, the trainer is set to run for 50 epochs. After the initiation, we get an estimation of the time it takes per epoch to get an idea of the total training duration. Let’s run it for a bit to see how long it takes.

\n

With this small model and dataset, the estimated time to run 50 epochs is 20 hours – this value changes depending on the infrastructure –.

\n
\n
Question
\n

Will the model be trained to 50 epochs?

\n
👁 View solution\n
\n

Setting the number of epochs to 50 doesn’t mean the model will train for all 50 epochs. It’s likely to stop earlier

\n
\n
\n

The 50 epochs serve as a maximum limit. The model will stop training earlier if it reaches the minimum loss and then starts to increase again, thanks to the early stopping callback. This means the model might only require half the epochs, perhaps 25 epochs or 10 hours, to achieve optimal performance.

\n
\n
Comment: Don't train until the end
\n

The idea here is not to train the model until completion, as it would take too much time.

\n
\n

Let’s stop the actual training and cheat a bit by loading a previously trained Mistral model:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-45", "source": [ "model = AutoModelForCausalLM.from_pretrained(\"RaphaelMourad/Mistral-DNA-v1-17M-hg38\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-46", "source": "

This is a mixed model that was pre-trained on the entire Human Genome. It contains approximately 17 million parameters and was trained using the Human Genome assembly GRCh38. Unlike models pre-trained on sequences of 200 bases, this model was pre-trained on sequences of 10,000 bases (10K). The advantage of this model is its ability to process larger DNA contexts or sequences. This capability allows it to capture more extensive patterns and dependencies within the genomic data.

\n
\n
Question
\n

By looking at the output of:

\n
model\n
\n
    \n
  1. How many transformer layers does this model have?
  2. \n
  3. Is it similar to previous model?
  4. \n
\n
👁 View solution\n
\n
    \n
  1. 8 transformer layers
  2. \n
  3. Yes
  4. \n
\n
\n
\n

Compute the embedding of a DNA sequence

\n

With this kind of model something, we can convert the DNA sequence to a vector.

\n

Let’s:

\n
    \n
  1. Take a DNA sequence
  2. \n
  3. Tokenizes the DNA sequence using the tokenizer created before
  4. \n
  5. Extracts the tensor containing the token IDs from the tokenized output
  6. \n
  7. Passes the tokenized input through the model.
  8. \n
  9. Extracts the hidden states from the model’s output.
  10. \n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-47", "source": [ "dna = \"ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC\"\n", "tokenized_dna = tokenizer(dna, return_tensors = 'pt')\n", "inputs = tokenized_dna[\"input_ids\"]\n", "model_outputs = model(inputs)\n", "hidden_states = model_outputs[0]" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-48", "source": "

The generated hidden states are the internal representations of the input sequence at different layers of the model. Here we look at the hidden neurons of the last layer. They capture contextual information about the sequence and provide a richer representation of the sequence compared to the raw nucleotide string, capturing contextual information that can be used for tasks such as sequence similarity analysis, functional prediction, variant impact analysis, and more.

\n
\n
Question
\n

What is the shape of hidden_states?

\n
👁 View solution\n
\n

[1, 17, 4096]:

\n\n
\n
\n

We would like now to calculate the mean of the hidden states across a specific dimension, here the first layer of the model (hidden_states[0]):

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-49", "source": [ "embedding_mean = torch.mean(hidden_states[0], dim=0)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-50", "source": "

dim=0 indicates that the mean is calculated across the sequence length dimension. This effectively averages the hidden states for each token position in the sequence, resulting in a single vector that represents the entire sequence.

\n
\n
Question
\n
    \n
  1. What is the shape of embedding_mean?
  2. \n
  3. Which type of data is in embedding_mean?
  4. \n
\n
👁 View solution\n
\n
    \n
  1. 4096, the number of possible tokens.
  2. \n
  3. embedding_mean is a vector of numerical values.
  4. \n
\n
\n
\n

embedding_mean is a numerical vector of size 4,096 that represents the average embedding of the DNA sequence. This fixed-size representation can be used for various downstream tasks, such as classification, clustering, or similarity comparisons.

\n
\n
Hands On
\n

Apply a max pooling instead of a mean pooling to summarize information along the DNA sequence.

\n
👁 View solution\n
\n
embedding_max = torch.max(hidden_states[0], dim=0)[0]\n
\n
\n\n
\n
Comment: Similar process to ChatGPT
\n

When you use a system like ChatGPT, the process involves converting your textual input, or “prompt,” into a numerical vector. This conversion is similar to the process we just did. Here’s how it works:

\n\n

This process of converting text into numerical vectors is fundamental to how language models like ChatGPT operate, enabling them to interpret and generate human-like text based on the input they receive.

\n
\n

Conclusion

\n

This tutorial provides a comprehensive guide to preparing, training, and utilizing a pre-trained language model for DNA sequence analysis. It begins by setting up the necessary resources, including installing dependencies, importing Python libraries, and configuring computational resources. The tutorial then walks through loading and choosing an appropriate model architecture for DNA sequences, followed by setting up a tokenizer to convert DNA sequences into numerical tokens. Data preparation involves loading, tokenizing, splitting, and collating DNA sequences to ensure efficient model training. The training process is detailed with parameter definitions and pretraining steps, culminating in the calculation of DNA sequence embeddings.

\n

We can now leverage the pre-trained model in various bioinformatics applications, such as sequence similarity analysis and functional prediction, offering a robust foundation for integrative biological research.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "- Efficient Model Training: By leveraging parameter-efficient fine-tuning techniques and distributed training strategies, it is possible to train large language models on DNA sequences using consumer-grade hardware, making advanced bioinformatics research more accessible.\n", "- Importance of Data Preparation: Properly tokenizing and organizing DNA sequence data is crucial for effective model training and evaluation, as it directly impacts the model's ability to learn and generalize from the data.\n", "- Practical Applications of Embeddings: The embeddings generated by a trained language model capture rich contextual information about DNA sequences, enabling a wide range of downstream applications, from sequence classification to functional prediction in genomics research.\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/statistics/tutorials/genomic-llm-pretraining/tutorial.html#feedback) and check there for further resources!\n" ] } ] }