<div style="border: 2px solid #8A9AD0; margin: 1em 0.2em; padding: 0.5em;">

# Generating Artificial Yeast DNA Sequences using a DNA LLM

by [Raphael Mourad](https://training.galaxyproject.org/hall-of-fame/raphaelmourad/), [B√©r√©nice Batut](https://training.galaxyproject.org/hall-of-fame/bebatut/)

CC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)

**Objectives**

- How do you set up a computational environment for generating synthetic DNA sequences using pre-trained language models?
- What role does the temperature parameter play in controlling the variability of generated DNA sequences?
- How can you compare generated synthetic DNA sequences with real genomic sequences using k-mer counts and PCA?
- What is the significance of performing BLAST searches on generated DNA sequences, and how do you interpret the results?
- How can you detect open reading frames (ORFs) in generated DNA sequences and translate them into amino acid sequences?

**Objectives**

- Describe the process of generating synthetic DNA sequences using pre-trained language models and explain the significance of temperature settings in controlling sequence variability.
- Set up a computational environment (e.g., Google Colab) and configure a pre-trained language model to generate synthetic DNA sequences, ensuring all necessary libraries are installed and configured.
- Use k-mer counts and Principal Component Analysis (PCA) to compare generated synthetic DNA sequences with real genomic sequences, identifying similarities and differences.
- Perform BLAST searches to assess the novelty of generated DNA sequences and interpret the results to determine the biological relevance and uniqueness of the synthetic sequences.
- Develop a pipeline to detect open reading frames (ORFs) within generated DNA sequences and translate them into amino acid sequences, demonstrating the potential for creating novel synthetic genes.

**Time Estimation: 3H**
</div>


<p>Generating synthetic DNA sequences using pre-trained language models  bridges the fields of synthetic biology and artificial intelligence, enabling the creation of novel DNA sequences that closely mimic natural genomes. By leveraging the power of advanced language models, we can generate biologically relevant sequences that have the potential to revolutionize genetic engineering, drug discovery, and our understanding of genomic function.</p>
<p>Throughout this tutorial, we will learn how to set up a computational environment tailored for DNA sequence generation, configure pre-trained language models to produce synthetic sequences, and analyze the results to assess their biological significance. Here the aim is to generate DNA sequences similar to yeast, more specifically to <em>Saccharomyces cerevisiae</em>.</p>
<p>For this tutorial, we‚Äôll use a <a href="https://huggingface.co/RaphaelMourad/Mistral-DNA-v1-138M-yeast">pre-trained language model</a> which was trained on 1,011 <em>Saccharomyces cerevisiae</em> isolates from {% cite peter2018genome %}. This model has 138 million parameters and uses a mixture of experts architecture, making it efficient and powerful for generating DNA sequences.</p>


In [None]:
model_name = "RaphaelMourad/Mistral-DNA-v1-138M-yeast"

<blockquote class="agenda" style="border: 2px solid #86D486;display: none; margin: 1em 0.2em">
<div class="box-title agenda-title" id="agenda">Agenda</div>
<p>In this tutorial, we will cover:</p>
<ol id="markdown-toc">
<li><a href="#prepare-resources" id="markdown-toc-prepare-resources">Prepare resources</a>    <ol>
<li><a href="#install-dependencies" id="markdown-toc-install-dependencies">Install dependencies</a></li>
</ol>
</li>
</ol>
</blockquote>
<h1 id="prepare-resources">Prepare resources</h1>
<h2 id="install-dependencies">Install dependencies</h2>
<p>The first step is to install the required dependencies:</p>


In [None]:
!pip install Bio==1.7.1
!pip install orfipy
!pip install sklearn
!pip install transformers -U
!pip install torch==2.5.0

<h2 id="import-python-libraries">Import Python libraries</h2>
<p>Let‚Äôs now import them.</p>


In [None]:
import os

import itertools
import matplotlib.pyplot as plt
import numpy as np
import orfipy_core
import pandas as pd
import torch
from Bio import (SeqIO, Seq)
from collections import defaultdict
from sklearn.decomposition import PCA
from transformers import pipeline
from typing import List, Tuple

<blockquote class="comment" style="border: 2px solid #ffecc1; margin: 1em 0.2em">
<div class="box-title comment-title" id="comment-versions"><i class="far fa-comment-dots" aria-hidden="true" ></i> Comment: Versions</div>
<p>This tutorial has been tested with following versions:</p>
<ul>
<li><code style="color: inherit">numpy</code> &gt; 1.26.4</li>
<li><code style="color: inherit">transformers</code> &gt; 4.47.1</li>
</ul>
<p>You can check the versions with:</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">np.__version__
transformers.__version__
</code></pre></div>  </div>
</blockquote>
<h2 id="check-and-configure-available-resources">Check and configure available resources</h2>
<p>Let‚Äôs check the GPU usage and RAM:</p>


In [None]:
!nvidia-smi

<p>Let‚Äôs configure PyTorch and the CUDA environment ‚Äì software and hardware ecosystem provided by NVIDIA to enable parallel computing on GPU ‚Äì to optimize GPU memory usage and performance:</p>
<ol>
<li>
<p>Enables CuDNN benchmarking in PyTorch:</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit"> torch.backends.cudnn.benchmark=True
</code></pre></div>    </div>
</li>
<li>
<p>Set an environment variable that configures how PyTorch manages CUDA memory allocations</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit"> os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
</code></pre></div>    </div>
</li>
</ol>
<h1 id="generate-synthetic-dna-sequences">Generate Synthetic DNA Sequences</h1>
<p>Using the pre-trained language model which was trained on 1,011 <em>Saccharomyces cerevisiae</em> isolates, we would like to generate 100 synthetic yeast DNA sequences.</p>
<h2 id="build-the-sequence-generator">Build the Sequence Generator</h2>
<p>First, we need to set up the sequence generation pipeline. This pipeline will enable us to generate synthetic DNA sequences that mimic natural genomic sequences. By leveraging the power of language models, we can create novel DNA sequences for various applications in synthetic biology.</p>
<p>We use the <code style="color: inherit">pipeline</code> function from the <code style="color: inherit">transformers</code> library which simplifies the process of setting up a sequence generation pipeline. It abstracts away the complexities of model loading and configuration, allowing us to focus on generating sequences.</p>


In [None]:
generator = pipeline("text-generation", model=model_name)

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<ol>
<li>What do the parameters <code style="color: inherit">"text-generation"</code>?</li>
<li>What is a pipeline?</li>
<li>What are the different types of pipelines?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution"><button class="gtn-boxify-button solution" type="button" aria-controls="solution" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>
<p>It specifies that we are creating a pipeline for text generation, which is suitable for generating DNA sequences.</p>
</li>
<li>
<p>Pipelines are high-level abstractions that simplify the process of applying models to various NLP tasks</p>
</li>
<li>
<p>Besides text-generation, there are several other types of pipelines designed for different applications. Here are some of the most commonly used pipelines:</p>
<ul>
<li><code style="color: inherit">feature-extraction</code>: Extracts features (embeddings) from text using a model. Useful for tasks that require text representation, such as clustering or similarity measurement.</li>
<li><code style="color: inherit">sentiment-analysis</code>: Determines the sentiment of a given text, classifying it as positive, negative, or neutral.</li>
<li><code style="color: inherit">text-classification</code>: Classifies text into predefined categories or labels, useful for tasks like spam detection, topic classification, etc.</li>
<li><code style="color: inherit">token-classification</code>: Assigns a label to each token in the input text, commonly used for Named Entity Recognition (NER) and Part-Of-Speech (POS) tagging.</li>
<li><code style="color: inherit">question-answering</code>: Answers questions based on a given context or passage of text. Useful for building Q&amp;A systems.</li>
<li><code style="color: inherit">fill-mask</code>: Predicts the missing word(s) in a sentence with a masked token, often used with models like BERT.</li>
<li><code style="color: inherit">summarization</code>: Generates a concise summary of a longer text, useful for creating abstracts or condensing information.</li>
<li><code style="color: inherit">translation</code>: Translates text from one language to another, supporting various language pairs.</li>
<li><code style="color: inherit">conversational</code>: Engages in a conversation by generating responses based on input prompts, useful for building chatbots.</li>
<li><code style="color: inherit">zero-shot-classification</code>: Classifies text into categories that were not seen during training, allowing for flexible and dynamic classification tasks.</li>
<li><code style="color: inherit">table-question-answering</code>: Answers questions based on structured data, such as tables, combining NLP with data querying capabilities.</li>
</ul>
<p>Each of these pipelines is designed to handle specific NLP tasks efficiently, leveraging pre-trained models to provide accurate and fast results. You can initialize these pipelines using the pipeline function from the transformers library, specifying the task you want to perform.</p>
</li>
</ol>
</details>
</blockquote>
<h2 id="generate-synthetic-dna-sequences">Generate Synthetic DNA Sequences</h2>
<p>Once the pipeline is set up, we can generate synthetic DNA sequences by calling the generator with appropriate parameters. We first need to specify the parameters for sequence generation:</p>
<ul>
<li>
<p>Maximum length of the generated sequence in term of tokens (i.e. k-mers of 3 to 7 bases)</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">  max_length = 30
</code></pre></div>    </div>
</li>
<li>
<p>Number of sequences to generate</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">  num_sequences = 100
</code></pre></div>    </div>
</li>
<li>
<p>The temperature:</p>
<p>It controls the randomness of the generated sequences. A higher temperature results in more varied outputs, while a lower temperature produces more deterministic sequences.</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">  temperature = 0.1
</code></pre></div>    </div>
</li>
</ul>
<p>Let‚Äôs now generate the sequences:</p>


In [None]:
synthetic_dna = generator(
  "",
  max_length=max_length,
  do_sample=True,
  top_k=50,
  temperature=temperature,
  repetition_penalty=1.2,
  num_return_sequences=num_sequences,
  eos_token_id=0,
)

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-1"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What do the parameters?</p>
<ol>
<li><code style="color: inherit">""</code></li>
<li><code style="color: inherit">do_sample=True</code></li>
<li><code style="color: inherit">top_k=50</code></li>
<li><code style="color: inherit">repetition_penalty=1.2</code></li>
<li><code style="color: inherit">eos_token_id=-0</code></li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-1"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-1" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>
<p>Starting prompt (empty for unconditional generation)</p>
</li>
<li><code style="color: inherit">do_sample=True</code>: When set to <code style="color: inherit">True</code>, the model uses sampling to generate sequences, which introduces randomness and variability in the output.</li>
<li><code style="color: inherit">top_k=50</code>: Limits the generated tokens to the top k most probable tokens. This helps in controlling the diversity of the output while ensuring biological relevance.</li>
<li><code style="color: inherit">repetition_penalty=1.2</code>: Penalizes the repetition of the same token in the generated sequence. A value greater than 1.0 discourages the model from repeating tokens, promoting diversity.</li>
<li><code style="color: inherit">eos_token_id=0</code>: Specifies the end-of-sequence token ID, which signals the model to stop generating further tokens. This is useful for controlling the termination of sequence generation</li>
</ol>
</details>
</blockquote>
<p>Let‚Äôs extract the sequences:</p>


In [None]:
artificial_sequences=[]
for seq in synthetic_dna:
    artificial_sequences.append(seq["generated_text"].replace(" ", ""))

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-2"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>What is the length of the 5 first sequences?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-2"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-2" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">for i in range(5):
  print(len(artificial_sequences[i]))
</code></pre></div>    </div>
</details>
</blockquote>
<blockquote class="details" style="border: 2px solid #ddd; margin: 1em 0.2em">
<div class="box-title details-title" id="details-generate-random-sequences-from-a-sequence-seed"><button class="gtn-boxify-button details" type="button" aria-controls="details-generate-random-sequences-from-a-sequence-seed" aria-expanded="true"><i class="fas fa-info-circle" aria-hidden="true" ></i> <span>Details: Generate random sequences from a sequence seed</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<p>To generate artificial DNA sequence starting by a defined substring, e.g. ‚ÄúTATA‚Äù, we replace <code style="color: inherit">""</code> in <code style="color: inherit">generator</code> function by the defined substring:</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">synthetic_dna = generator(
    "TATA",
    max_length=max_length,
    do_sample=True,
    top_k=50,
    temperature=0.4,
    repetition_penalty=1.2,
    num_return_sequences=100,
    eos_token_id=0,
)
artificial_sequences = [seq["generated_text"].replace(" ", "") for seq in synthetic_dna]
artificial_sequences[0:5]
</code></pre></div>  </div>
</blockquote>
<h2 id="compare-with-real-yeast-genome">Compare with Real Yeast Genome</h2>
<p>We would like to compare the generated sequences to real sequences from <em>Saccharomyces cerevisiae</em>.</p>
<p>Let‚Äôs download the yeast genome assembly:</p>


In [None]:
!wget http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.fa.gz
!gunzip sacCer3.fa.gz

<p>We would like now to extract 1,000 random sequences of a length 100 bases from the downloaded genome:</p>


In [None]:
def extract_random_sequences(
    genome_file: str,
    seq_length: int = 100,
    num_seqs: int = 100,
) -> List[str]:
    """
    Extracts random sequences from a genome FASTA file.

    Parameters:
    - genome_file (str): Path to the genome FASTA file.
    - seq_length (int): Length of each sequence to extract.
    - num_seqs (int): Number of sequences to extract.

    Returns:
    - List[str]: A list of extracted sequences.
    """
    try:
        # Load genome sequences from the FASTA file
        genome = [str(record.seq) for record in SeqIO.parse(genome_file, "fasta")]

        # Join all chromosomes or scaffolds into one large sequence
        genome_seq = "".join(genome)
        genome_size = len(genome_seq)

        if genome_size < seq_length:
            raise ValueError("Sequence length is larger than the genome size.")

        # List to store extracted sequences
        extracted_seqs = []

        # Extract random sequences
        for _ in range(num_seqs):
            start_pos = random.randint(0, genome_size - seq_length)  # Random start position
            sequence = genome_seq[start_pos:start_pos + seq_length]  # Extract sequence
            extracted_seqs.append(sequence)

        return extracted_seqs

    except Exception as e:
        print(f"An error occurred: {e}")
        return []

genome_file = "sacCer3.fa"  # Path to the yeast genome FASTA file
real_sequences = extract_random_sequences(genome_file, seq_length=100, num_seqs=1000)

<p>To compare synthetic and real DNA sequences, we utilize k-mer counts as a method to numerically describe and analyze DNA sequences. K-mers are short, overlapping subsequences of a fixed length \(k\) within a DNA sequence. We can think of k-mers as ‚Äúwords‚Äù within the DNA sequence. Just as paragraphs in text share common words, similar DNA sequences will share common k-mers. By counting the occurrences of these k-mers, we can transform DNA sequences into numerical vectors, allowing us to compare them quantitatively. <strong>Sequences that are very similar are expected to have similar k-mer counts</strong>, much like how similar texts share common vocabulary. By comparing the k-mer counts of synthetic and real DNA sequences, we can assess their similarity. If two sequences share many k-mers, it indicates that they are likely to be similar in composition and structure.</p>
<p>This approach leverages the power of k-mer analysis to provide insights into the similarity between synthetic and real DNA sequences, aiding in the validation and evaluation of synthetic biology techniques.</p>


In [None]:
def generate_all_kmers(k: int) -> List[str]:
    """
    Generate all possible k-mers of a given length.

    Parameters:
    - k (int): Length of each k-mer.

    Returns:
    - List[str]: A list of all possible k-mers.
    """
    return ["".join(p) for p in itertools.product("ACGT", repeat=k)]

def kmer_counts_matrix(sequences: List[str], k: int = 6) -> Tuple[np.ndarray, List[str]]:
    """
    Compute k-mer counts for a list of sequences and return a count matrix.

    Parameters:
    - sequences (List[str]): List of DNA sequences.
    - k (int): Length of each k-mer. Default is 6.

    Returns:
    - Tuple[np.ndarray, List[str]]: A tuple containing the k-mer count matrix and the list of all possible k-mers.
    """
    all_kmers = generate_all_kmers(k)
    kmer_index = {kmer: idx for idx, kmer in enumerate(all_kmers)}  # Map k-mers to column indices
    matrix = np.zeros((len(sequences), len(all_kmers)), dtype=int)  # Initialize count matrix

    for seq_idx, seq in enumerate(sequences):
        if len(seq) < k:
            raise ValueError(f"Sequence length is less than {k} for sequence {seq_idx}: {seq}")

        kmer_dict = defaultdict(int)
        for i in range(len(seq) - k + 1):
            kmer = seq[i:i+k]
            kmer_dict[kmer] += 1

        # Fill in the matrix for this sequence
        for kmer, count in kmer_dict.items():
            if kmer in kmer_index:
                matrix[seq_idx, kmer_index[kmer]] = count

    return matrix, all_kmers

<p>Let‚Äôs now count 4-mers for artificial and real sequences.</p>


In [None]:
k = 4
artificial_seq_kmer_counts, kmers = kmer_counts_matrix(artificial_sequences, k=k)
real_seq_kmer_counts, kmers = kmer_counts_matrix(real_sequences, k=k)

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-3"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<ol>
<li>What is the format of <code style="color: inherit">artificial_seq_kmer_counts</code> and <code style="color: inherit">real_seq_kmer_counts</code>?</li>
<li>What are the dimensions?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-3"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-3" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li><code style="color: inherit">artificial_seq_kmer_counts</code> and <code style="color: inherit">real_seq_kmer_counts</code> are 2-matrices.</li>
<li>100 x 256 (\(256 = 4^{4}\))</li>
</ol>
</details>
</blockquote>
<p>To visualize and compare real and generated DNA sequences in a continuous space, we can utilize Principal Component Analysis (PCA). PCA is a powerful dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space while preserving as much variability as possible. This allows us to visualize the similarities and differences between the sequences effectively.</p>


In [None]:
def plot_pca_projection(set1: np.ndarray, set2: np.ndarray, title: str = "PCA Projection") -> None:
    """
    Fit PCA on set1 and project both set1 and set2 into the PCA space, then plot the results.

    Parameters:
    - set1 (np.ndarray): K-mer count matrix for the first set of sequences (used to fit PCA).
    - set2 (np.ndarray): K-mer count matrix for the second set of sequences (projected into PCA space).
    - title (str): Title of the plot.

    Returns:
    - None
    """
    # Fit PCA on set1 only
    pca = PCA(n_components=2)
    pca.fit(set1)

    # Project both set1 and set2 into the PCA space
    pca_set1 = pca.transform(set1)
    pca_set2 = pca.transform(set2)

    # Plot the first two principal components
    plt.figure(figsize=(8, 6))
    plt.scatter(
        pca_set1[:, 0],
        pca_set1[:, 1],
        color="blue",
        label="Real sequences",
        alpha=0.7,
    )
    plt.scatter(
        pca_set2[:, 0],
        pca_set2[:, 1],
        color="red",
        label="Artificial sequences",
        alpha=0.7,
    )

    # Add labels and legend
    plt.xlabel("Principal Component 1")
    plt.ylabel("Principal Component 2")
    plt.title(title)
    plt.legend()

    # Show the plot
    plt.show()


plot_pca_projection(real_seq_kmer_counts, artificial_seq_kmer_counts)

<p><a href="./images/pca_k_4_t_0_1.png" rel="noopener noreferrer"><img src="./images/pca_k_4_t_0_1.png" alt="Scatter plot titled 'PCA Projection of Set1 and Set2 (PCA built from Set1)' displaying the first two principal components. Blue dots represent 'Real sequences' and red dots represent 'Artificial sequences.' The x-axis is labeled 'Principal Component 1' and the y-axis is labeled 'Principal Component 2.' The plot shows a dense cluster of points near the origin with some outliers, indicating the distribution and overlap between real and artificial sequences in the PCA space." width="689" height="547" loading="lazy" /></a></p>
<blockquote class="hands-on">
<div class="box-title hands-on-title" id="hands-on"><i class="fas fa-pencil-alt" aria-hidden="true" ></i> Hands On</div>
<p>For different temperature values (\(0.001\), \(0.01\), \(0.1\), \(0.5\), \(1.0\), \(1.5\)):</p>
<ol>
<li>Generate synthetic DNA sequences</li>
<li>Observe the 5 first generated sequences</li>
<li>Generate the PCA plots</li>
<li>Compute the variance of k-mers counts between artificial sequences</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-4"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-4" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">
def generate_synthetic_dna(
    model_name: str,
    max_length: int,
    num_sequences: int,
    temp: float,
    top_k: int = 50,
    repetition_penalty: float = 1.2,
) -&gt; List[str]:
    """
    Generate synthetic DNA sequences using a pre-trained language model.

    Parameters:
    - model_name (str): Name or path of the pre-trained model.
    - max_length (int): Maximum length of each generated sequence.
    - num_sequences (int): Number of sequences to generate.
    - temp (float): Temperature value to control sequence variability.
    - top_k (int): Number of highest probability vocabulary tokens to consider. Default is 50.
    - repetition_penalty (float): Penalty for repeating the same token. Default is 1.2.

    Returns:
    - List[str]: List of generated DNA sequences.
    """
    # Initialize the sequence generation pipeline
    generator = pipeline("text-generation", model=model_name)
    # Generate synthetic DNA sequences
    synthetic_dna = generator(
       "",
       max_length=max_length,
       do_sample=True,
       top_k=top_k,
       temperature=temp,
       repetition_penalty=repetition_penalty,
       num_return_sequences=num_sequences,
       eos_token_id=0,
    )
    # Extract and clean the generated sequences
    artificial_sequences = [seq["generated_text"].replace(" ", "") for seq in synthetic_dna]
    return artificial_sequences


temperatures = [0.001, 0.01, 0.1, 0.5, 1.0, 1.5]
for temp in temperatures:
    art_sequences = generate_synthetic_dna(model_name, max_length,  num_sequences, temp)
    art_seq_kmer_counts, kmers = kmer_counts_matrix(art_sequences, k=k)
    plot_pca_projection(real_seq_kmer_counts, art_seq_kmer_counts)
    var = np.mean(np.var(art_seq_kmer_counts, axis=0))
    print(f"Variance for temperature of { temp }: { var }")
</code></pre></div>    </div>
</details>
</blockquote>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-4"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<ol>
<li>How similar do the 5 first generated DNA sequences look like given the temperature?</li>
<li>What are the different values for the variance?</li>
<li>What can we conclude?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-5"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-5" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>Lower the temperature, more similar the sequences.</li>
<li>The variance for the different temperatures are:
<ul>
<li>Variance \(=0.039\) for temperature \(= 0.001\)</li>
<li>Variance \(=0.219\) for temperature \(= 0.01\)</li>
<li>Variance \(=0.367\) for temperature \(= 0.1\)</li>
<li>Variance \(=0.404\) for temperature \(= 0.5\)</li>
<li>Variance \(=0.376\) for temperature \(= 1\)</li>
<li>Variance \(=0.377\) for temperature \(= 1.5\)</li>
</ul>
</li>
<li>The higher the temperature, higher the variance, higher the variability.</li>
</ol>
</blockquote>
</blockquote>
<h2 id="checking-for-novelty-in-generated-dna-sequences-using-blast">Checking for Novelty in Generated DNA Sequences Using BLAST</h2>
<p>After generating synthetic DNA sequences, it‚Äôs crucial to verify whether these sequences are truly novel or if they closely resemble existing sequences in biological databases. One effective method to accomplish this is by performing a BLAST (Basic Local Alignment Search Tool) search ({% cite altschul1990basic %}). BLAST allows us to compare our generated sequences against a comprehensive database of known DNA sequences, helping us determine their uniqueness and potential biological relevance.</p>
<p>Let‚Äôs start with a synthetic DNA sequence generated with a low temperature setting. Low temperature settings reduce variability, making the generated sequences more similar to real sequences. This step helps ensure that the sequence is biologically plausible and likely to have counterparts in existing databases.</p>
<blockquote class="hands-on">
<div class="box-title hands-on-title" id="hands-on-1"><i class="fas fa-pencil-alt" aria-hidden="true" ></i> Hands On</div>
<ol>
<li>Generate the synthetic DNA sequences with a temperature value of \(0.01\)</li>
<li>Get the first sequence</li>
<li>Make a <a href="https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&amp;PAGE_TYPE=BlastSearch&amp;LINK_LOC=blasthome">Standard Nucleotide BLAST search</a> of the first sequence</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-6"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-6" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">
art_sequences = generate_synthetic_dna(model_name, max_length,  num_sequences, 0.01)
art_sequences[0]
</code></pre></div>    </div>
</blockquote>
</blockquote>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-5"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<ol>
<li>How many significant similarities have been found?</li>
<li>Have we generated new sequence?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-7"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-7" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>The BLAST search do not return any significant match</li>
<li>When the BLAST search returns no significant matches, it suggests that the synthetic sequence is novel and does not closely resemble any known sequences. This outcome is particularly interesting because it indicates that the sequence generation process has produced something entirely new.</li>
</ol>
</blockquote>
</blockquote>
<p>When generating sequences with low temperature, we observed limited variability among the sequences. This lack of variability can be useful for ensuring consistency but may limit the exploration of novel sequence spaces. To introduce more variability and potentially discover even more novel sequences, we can increase the temperature setting during sequence generation. Higher temperatures introduce greater randomness, leading to more diverse and potentially unique sequences.</p>
<blockquote class="hands-on">
<div class="box-title hands-on-title" id="hands-on-2"><i class="fas fa-pencil-alt" aria-hidden="true" ></i> Hands On</div>
<ol>
<li>Generate the synthetic DNA sequences with a temperature value of \(1\)</li>
<li>Get the first sequence</li>
<li>Make a <a href="https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&amp;PAGE_TYPE=BlastSearch&amp;LINK_LOC=blasthome">Standard Nucleotide BLAST search</a> of the first sequence</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-8"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-8" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">
art_sequences = generate_synthetic_dna(model_name, max_length,  num_sequences, 1)
art_sequences[0]
</code></pre></div>    </div>
</blockquote>
</blockquote>
<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-6"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<ol>
<li>How many significant similarities have been found?</li>
<li>Have we generated new sequence?</li>
</ol>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-9"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-9" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<ol>
<li>The BLAST search do not return any significant match</li>
<li>When the BLAST search returns no significant matches, it suggests that the synthetic sequence is novel and does not closely resemble any known sequences. This outcome is particularly interesting because it indicates that the sequence generation process has produced something entirely new.</li>
</ol>
</blockquote>
</blockquote>
<p>Even if the generated sequences are novel, novelty alone does not guarantee functionality, so further analysis may be necessary to assess the potential applications of these sequences.</p>
<h1 id="exploring-synthetic-gene-generation">Exploring Synthetic Gene Generation</h1>
<p>Having successfully generated and analyzed synthetic DNA sequences, the next intriguing question is: Can we extend this approach to generate entire synthetic genes? This endeavor requires producing longer sequences that mimic the complexity and functionality of natural genes. The mean length of yeast genes is several hundred bases. To effectively mimic natural genes, our synthetic sequences must match this length while maintaining biological plausibility. Generating longer sequences is not merely about increasing length; it‚Äôs about ensuring that these sequences contain the necessary genetic elements, such as promoters, coding regions, and regulatory sequences, that are essential for gene function.</p>
<p>To start, we generate 10 sequences with <code style="color: inherit">max_length = 400</code>:</p>


In [None]:
synthetic_dna = generate_synthetic_dna(model_name, 400, 10, 1)

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-7"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>How long are the first 5 sequences?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-10"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-10" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">[len(s) for s in synthetic_dna[:10] ]
</code></pre></div>    </div>
</details>
</blockquote>
<h2 id="detecting-open-reading-frames-orfs">Detecting Open Reading Frames (ORFs)</h2>
<p>To determine if the generated sequences could contain genes, we examine them for the presence of key genetic elements, such as Open Reading Frames (ORFs).</p>
<p>An ORF is a portion of a DNA sequence that begins with a start codon (e.g., ATG) and ends with a stop codon (e.g., TAA, TAG, TGA). Identifying ORFs is essential for determining the potential functionality of synthetic DNA sequences, as ORFs can indicate whether a sequence could encode a protein.</p>
<p>To detect ORFs in sequences, we use <code style="color: inherit">orfipy_core</code>:</p>


In [None]:
def detect_orfs(
    sequence: str,
    minlen: int = 100,
    maxlen: int = 1000,
) -> List[Tuple[int, int, str, str]]:
    """
    Detect ORFs in a given DNA sequence.

    Parameters:
    - sequence (str): The DNA sequence to analyze.
    - minlen (int): Minimum length of ORFs to detect. Default is 100.
    - maxlen (int): Maximum length of ORFs to detect. Default is 1000.

    Returns:
    - List[Tuple[int, int, str, str]]: List of ORFs with start, stop positions, strand, and description.
    """
    orfs = []
    for start, stop, strand, description in orfipy_core.orfs(sequence, minlen=minlen, maxlen=maxlen):
        orfs.append((start, stop, strand, description))
        print(f"Start: {start}, Stop: {stop}, Strand: {strand}, Description: {description}")
    return orfs

<p>Let‚Äôs extract the ORFs for the first DNA sequence:</p>


In [None]:
seq = artificial_sequences[0]
orfs = detect_orfs(seq)

<blockquote class="question" style="border: 2px solid #8A9AD0; margin: 1em 0.2em">
<div class="box-title question-title" id="question-8"><i class="far fa-question-circle" aria-hidden="true" ></i> Question</div>
<p>How many ORFs have been detected for the first DNA sequence?</p>
<br/><details style="border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;"><summary>üëÅ View solution</summary>
<div class="box-title solution-title" id="solution-11"><button class="gtn-boxify-button solution" type="button" aria-controls="solution-11" aria-expanded="true"><i class="far fa-eye" aria-hidden="true" ></i> <span>Solution</span><span class="fold-unfold fa fa-minus-square"></span></button></div>
<p><code style="color: inherit">len(orfs)</code></p>
</details>
</blockquote>
<h2 id="extracting-and-translating-detected-open-reading-frames-orfs">Extracting and Translating Detected Open Reading Frames (ORFs)</h2>
<p>After detecting Open Reading Frames (ORFs) in our synthetic DNA sequences, the next step is to extract these ORFs and translate them into protein sequences. This process allows us to explore the potential protein products encoded by our synthetic genes, providing insights into their possible functions and biological roles.</p>
<p>To extract and translate the ORFs, we need to:</p>
<ol>
<li>Extract the ORF Sequence: Using the start and stop positions identified by the <code style="color: inherit">detect_orfs</code> function, we can extract the corresponding DNA sequence for each ORF. This sequence represents the potential protein-coding region within the synthetic DNA.</li>
<li>Translate the ORF: We can convert the DNA sequence of the ORF into its corresponding amino acid sequence. This step reveals the potential protein product encoded by the ORF.</li>
</ol>


In [None]:
def extract_and_translate_orf(
    sequence: str,
    orfs: List[Tuple[int, int, str, str]],
    orf_index: int = 0,
) -> str:
    """
    Extract an ORF from a DNA sequence and translate it into a protein sequence.

    Parameters:
    - sequence (str): The DNA sequence containing the ORF.
    - orfs (List[Tuple[int, int, str, str]]): List of detected ORFs with start, stop positions, strand, and description.
    - orf_index (int): Index of the ORF to extract and translate. Default is 0.

    Returns:
    - str: The translated protein sequence.
    """
    # Extract the ORF sequence using the start and stop positions
    start, stop, strand, description = orfs[orf_index]
    orf_sequence = sequence[start:stop]

    # Translate the ORF sequence into a protein sequence
    protein_sequence = str(Seq.Seq(orf_sequence).translate())

    return protein_sequence

<p>For the first detected ORFs:</p>


In [None]:
protein_seq = extract_and_translate_orf(seq, orfs)
print(f"Translated Protein Sequence: {protein_seq}")

<p>After translating a synthetic DNA sequence into a protein sequence, the next crucial step is to analyze this sequence to uncover its potential biological function. This involves identifying known protein domains or motifs within the sequence that may indicate specific functions, such as binding sites or active regions. Verifying the biological plausibility of the translated sequence is essential, which can be achieved by comparing it to known protein sequences in databases like UniProt or GenBank using tools like Diamond and InterProScan. Additionally, examining the conservation of the sequence across different species can provide insights into its functional importance. If the sequence shows potential for biological function, it can serve as a starting point for further research, including structural modeling to predict its three-dimensional structure, functional assays to test its activity, and evolutionary studies to understand its origins and adaptations. This comprehensive analysis not only enhances our understanding of the generated sequences but also opens new avenues for scientific exploration and application in synthetic biology and genetic engineering.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Throughout this tutorial, we have explored the process of generating synthetic DNA sequences using a DNA Language Model (LLM), focusing on creating sequences that mimic the complexity and functionality of natural yeast genomes. We began by building a sequence generator, leveraging pre-trained models to produce synthetic DNA sequences with controlled variability.</p>
<p>We compared of these synthetic sequences to real yeast genomes, utilizing k-mer counts and Principal Component Analysis (PCA) to visualize and assess their similarities and differences. This comparative analysis provided valuable insights into how well our generated sequences aligned with natural counterparts, highlighting both the strengths and areas for improvement in our approach.</p>
<p>To ensure the novelty of our generated sequences, we conducted BLAST searches, confirming that many of our synthetic sequences were indeed unique and did not closely match existing sequences in biological databases. This step was crucial in validating the potential of our model to produce truly innovative DNA sequences.</p>
<p>Expanding our exploration, we delved into the realm of synthetic gene generation. By detecting Open Reading Frames (ORFs) within our synthetic sequences, we identified potential protein-coding regions that could be translated into amino acid sequences. This process allowed us to analyze the biological relevance and potential functions of the proteins encoded by our synthetic genes, opening avenues for further research and application.</p>
<p>In conclusion, this tutorial has demonstrated the power and potential of DNA LLMs in generating synthetic DNA sequences and exploring synthetic gene generation. By combining advanced computational techniques with biological insights, we have shown how to create novel sequences that not only mimic natural genomes but also offer new possibilities for synthetic biology and genetic engineering. As we continue to refine these methods, the future holds exciting prospects for innovation and discovery in the field of genomics.</p>


# Key Points

- DNA Language Models (LLMs) are effective tools for generating synthetic DNA sequences that mimic natural genomes, offering a powerful approach for exploring and innovating in synthetic biology.
- Adjusting parameters like temperature allows for controlling the variability of generated sequences, enabling the creation of both biologically plausible and novel DNA sequences.
- Comparing synthetic sequences with real genomes and using tools like BLAST to check for novelty are essential steps in validating the uniqueness and potential biological relevance of generated sequences.
- Detecting and translating Open Reading Frames (ORFs) in synthetic sequences provides insights into potential protein functions, paving the way for further research in genetic engineering and synthetic biology applications.

# Congratulations on successfully completing this tutorial!

Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/statistics/tutorials/genomic-llm-sequence-generation/tutorial.html#feedback) and check there for further resources!
