Clean and manage Sanger sequences from raw files to aligned consensus
Author(s) | Coline Royaux |
Reviewers |
OverviewQuestions:Objectives:
How to clean Sanger sequencing files?
Requirements:
Learn how to manage sequencing files (AB1, FASTQ, FASTA)
Learn how to clean your Sanger sequences in an automated and reproducible way
Time estimation: 1 hourSupporting Materials:Published: Jan 8, 2024Last modification: Mar 5, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00383rating Rating: 2.0 (2 recent ratings, 2 all time)version Revision: 2
The objective of this tutorial is to learn how to clean and manage AB1 data files freshly obtained from Sanger sequencing. This kind of sequencing is targeting a specific sequence with short single DNA strands called primers. These primers are delimiting ends of the targeted marker. Usually, one gets two .ab1 files for each sample, representing the sense (forward) and the antisense (reverse) strands of DNA.
Here, we’ll be using raw data from “AOPEP variants as a novel cause of recessive dystonia: Generalized dystonia and dystonia-parkinsonism” 2022. In this article, two DNA markers are investiguated CHD8 (Chromodomain-helicase-DNA-binding protein 8) and AOPEP (Aminopeptidase O Putative). We’ll focus on CHD8 sequences but you can try to apply the same steps on the AOPEP sequences to practice after the tutorial !
In the first section of the tutorial, we’ll be preparing primer’s data by:
- selecting the right primer sequences with the identifier;
- removing eventual gaps included in the sequences;
- and compute the reverse-complement sequence for the antisense primer only.
In the second section of the tutorial, we’ll be preparing the Sanger sequences data by:
- extracting ab1 files of the interest sequence (CHD8) and separating sense and antisense sequences in two distinct data collections;
- converting ab1 files to FASTQ to permit its use in the following tools;
- trimming low quality ends of the sequences;
- compute the reverse-complement for the antisense sequence only;
- align sense and antisense sequences;
- obtain a consensus sequence (which results the correspondance between nucleotides of the sense and the antisense sequences) for each three samples.
In the third section of the tutorial, primers and all consensus sequences are finally merged into a single file to be aligned and verified.
Consider a double-strand DNA molecule with the following sequences:
When sequencing, each strand of DNA are read separately in the 5’-3’ orientation. Hence, in the sequence files each strand are provided as:
To get the antisense sequence in its original orientation, the reverse sequence is computed:
To align sense and antisense sequence, the complement sequence of the reversed antisense sequence is computed:
The two sequences can be aligned now:
AgendaIn this tutorial, we will cover:
- Get data
- Prepare primer data
- Prepare sequence data
- Unzip data files
- Filter collection to separate sense and antisense sequence files
- Convert AB1 sequence files to FASTQ and trim low-quality ends
- Compute reverse complement sequence for antisense (reverse) sequences only
- Merge corresponding sense and antisense sequences single files
- Convert FASTQ files to FASTA
- Align sequences and retrieve consensus for each sequence
- Manage primers and sequences
- Conclusion
- AOPEP Sanger files
Get data
Authors of “AOPEP variants as a novel cause of recessive dystonia: Generalized dystonia and dystonia-parkinsonism” 2022 have shared openly their raw AB1 files on Zenodo.
Hands-on: Data Upload
- Create a new history for this tutorial
Import the files from Zenodo :
https://zenodo.org/records/7104640/files/AOPEP_and_CHD8_sequences_20220907.zip
Change Type (set all): from “Auto-detect” to
zip
and click Start
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
- Go into Data (top panel) then Data libraries
- Navigate to the correct folder as indicated by your instructor.
- On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
- Select the desired files
- Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
In the pop-up window, choose
- “Select history”: the history you want to import the data to (or create a new one)
- Click on Import
- Create primer FASTA file, copy:
>Forward_CHD8 GAGGTGAAAGAATCATAAATTGG >Reverse_CHD8 CCCTGTGTACAAATAGCTTTTGT >Forward_AOPEP TCATGGTTCCAGGCAGAGTTATT >Reverse_AOPEP TGCTGTGACAAGCCAACCAATGG
- Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)
- Select Paste/Fetch Data
- Paste into the text field
- Change Type (set all): from “Auto-detect” to
fasta
- Change the name from “New File” to “Primer file”
- Click Start
Note these primer sequences were invented for the purpose of the tutorial, it is not the sequences used in the publication.
Prepare primer data
Separate and format primers files
Primers must be separated in distinct files because sense (forward) and antisense (reverse) primers won’t be subjected to the same formating.
Hands-on: Create separate files for each primer
- Filter FASTA ( Galaxy version 2.3) with the following parameters:
- param-file “FASTA sequences”:
Primer file
- “Criteria for filtering on the headers”:
Regular expression on the headers
- “Regular expression pattern the header should match”:
Reverse_CHD8
- Add tags “#Primer” and “#Reverse”
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
- Click on the dataset to expand it
- Click on Add Tags galaxy-tags
- Add tag text. Tags starting with
#
will be automatically propagated to the outputs of tools using this dataset (see below).- Press Enter
- Check that the tag appears below the dataset name
Tags beginning with
#
are special!They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
- a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
- dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for
+
and-
strands. This generates two datasets (4 and 5 for plus and minus, respectively);- datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
- datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with
#plus
and#minus
, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.More information is in a dedicated #nametag tutorial.
- Expand one of the output datasets of the tool (by clicking on it)
- Click re-run galaxy-refresh the tool
This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.
- Filter FASTA ( Galaxy version 2.3) with the following parameters:
- param-file “FASTA sequences”:
Primer file
- “Criteria for filtering on the headers”:
Regular expression on the headers
- “Regular expression pattern the header should match”:
Forward_CHD8
- Add tags “#Primer” and “#Forward”
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
- Click on the dataset to expand it
- Click on Add Tags galaxy-tags
- Add tag text. Tags starting with
#
will be automatically propagated to the outputs of tools using this dataset (see below).- Press Enter
- Check that the tag appears below the dataset name
Tags beginning with
#
are special!They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
- a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
- dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for
+
and-
strands. This generates two datasets (4 and 5 for plus and minus, respectively);- datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
- datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with
#plus
and#minus
, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.More information is in a dedicated #nametag tutorial.
Remove eventual gaps from primers Degap.seqs ( Galaxy version 1.39.5.0) with the following parameters:
- Click on param-files Multiple datasets
- Select several files by keeping the Ctrl (or COMMAND) key pressed and clicking on the files of interest
- param-files “fasta - Dataset”:
Two Filter FASTA outputs
(outputs of Filter FASTA tool)
In this previous hands-on, the step of removing eventual gaps (-
in the FASTA files) is a precaution, there are no gaps in our primers file. However, it is important to remove gaps at this point in case you are using different data, otherwise some steps of the tutorial could fail (e.g. alignment).
This following hands-on is to be applied only on the sequence of the antisense (reverse) primer.
Hands-on: Compute Reverse-Complement of the antisense (reverse) primer
- Reverse-Complement ( Galaxy version 1.0.2+galaxy0) the sequence antisense (reverse) primer with the following parameters:
- param-file “Input file in FASTA or FASTQ format”:
Degap.seqs #Reverse FASTA output
(output of Degap.seqs tool)See in the introduction for explanations on the Reverse-Complement.
Prepare sequence data
Unzip data files
Hands-on: Unzip
- Unzip ( Galaxy version 6.0+galaxy0) with the following parameters:
- param-file “input_file”:
AOPEP_and_CHD8_sequences_20220907.zip?download=1
- “Extract single file”:
All files
QuestionHow many files is there in the ZIP archive ?
12 (if you have a different number of files something likely went wrong)
From now on, we’ll be working a lot on data collections:
- Click on param-collection Dataset collection in front of the input parameter you want to supply the collection to.
- Select the collection you want to use from the list
Filter collection to separate sense and antisense sequence files
As for primers, sense and antisense sequences will be subjected to slightly different procedures so they must be separated in distinct data collections.
Hands-on: Filter
- Extract element identifiers ( Galaxy version 0.0.2) with the following parameters:
- param-collection “Dataset collection”:
output collection
(output of Unzip tool)- Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
- param-file “Select lines from”:
output
(output of Extract element identifiers tool)- In “Check”:
- param-repeat “Insert Check”
- “Find Regex”:
^[A-Za-z0-9_-]+F$
- “Replacement”: ``
- param-repeat “Insert Check”
- “Find Regex”:
^[A-Za-z0-9_-]+AOPEP[A-Za-z0-9_-]+$
- “Replacement”: ``
- Tag output with “#Reverse”
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
- Click on the dataset to expand it
- Click on Add Tags galaxy-tags
- Add tag text. Tags starting with
#
will be automatically propagated to the outputs of tools using this dataset (see below).- Press Enter
- Check that the tag appears below the dataset name
Tags beginning with
#
are special!They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
- a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
- dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for
+
and-
strands. This generates two datasets (4 and 5 for plus and minus, respectively);- datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
- datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with
#plus
and#minus
, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.More information is in a dedicated #nametag tutorial.
- Expand one of the output datasets of the tool (by clicking on it)
- Click re-run galaxy-refresh the tool
This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.
- Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
- param-file “Select lines from”:
output
(output of Extract element identifiers tool)- In “Check”:
- param-repeat “Insert Check”
- “Find Regex”:
^[A-Za-z0-9_-]+R$
- “Replacement”: ``
- param-repeat “Insert Check”
- “Find Regex”:
^[A-Za-z0-9_-]+AOPEP[A-Za-z0-9_-]+$
- “Replacement”: ``
- Tag output with “#Forward”
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
- Click on the dataset to expand it
- Click on Add Tags galaxy-tags
- Add tag text. Tags starting with
#
will be automatically propagated to the outputs of tools using this dataset (see below).- Press Enter
- Check that the tag appears below the dataset name
Tags beginning with
#
are special!They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
- a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
- dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for
+
and-
strands. This generates two datasets (4 and 5 for plus and minus, respectively);- datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
- datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with
#plus
and#minus
, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.More information is in a dedicated #nametag tutorial.
- Filter collection with the following parameters:
- param-collection “Input Collection:
output collection
(output of Unzip tool)- “How should the elements to remove be determined?”:
Remove if identifiers are ABSENT from file
- param-files “Filter out identifiers absent from”:
#Forward files list
&#Reverse files list
(output of Regex Find And Replace tool)- Tag
(filtered)
outputs with “#Forward” and “#Reverse”Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
- Click on the dataset to expand it
- Click on Add Tags galaxy-tags
- Add tag text. Tags starting with
#
will be automatically propagated to the outputs of tools using this dataset (see below).- Press Enter
- Check that the tag appears below the dataset name
Tags beginning with
#
are special!They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
- a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
- dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for
+
and-
strands. This generates two datasets (4 and 5 for plus and minus, respectively);- datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
- datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with
#plus
and#minus
, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.More information is in a dedicated #nametag tutorial.
Comment: What's happening in this section?First step: Extracting the list of file names in the data collection Second step: Removing file names containing a “F” and “AOPEP” -> creating a list of antisense (reverse) sequence files of the marker CHD8 Third step: Removing file names containing a “R” and “AOPEP” -> creating a list of sense (forward) sequence files of the marker CHD8 Fourth step: Select files in the collection -> creating two distinct collections with sense (forward) sequence files on one hand and antisense (reverse) sequence file on the other hand
For the second and third step, we used regular expressions (Regex):
Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.
Finding
Below are just a few examples of basic expressions:
Regular expression Matches abc
an occurrence of abc
within your data(abc|def)
abc
ordef
[abc]
a single character which is either a
,b
, orc
[^abc]
a character that is NOT a
,b
, norc
[a-z]
any lowercase letter [a-zA-Z]
any letter (upper or lower case) [0-9]
numbers 0-9 \d
any digit (same as [0-9]
)\D
any non-digit character \w
any alphanumeric character \W
any non-alphanumeric character \s
any whitespace \S
any non-whitespace character .
any character \.
{x,y}
between x and y repetitions ^
the beginning of the line $
the end of the line Note: you see that characters such as
*
,?
,.
,+
etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So\?
matches the question mark character exactly.Examples
Regular expression matches \d{4}
4 digits (e.g. a year) chr\d{1,2}
chr
followed by 1 or 2 digits.*abc$
anything with abc
at the end of the line^$
empty line ^>.*
Line starting with >
(e.g. Fasta header)^[^>].*
Line not starting with >
(e.g. Fasta sequence)Replacing
Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups
(...)
, which we can refer to using\1
,\2
etc for the first and second captured values. If you want to refer to the whole match, use&
.
Regular expression Input Captures chr(\d{1,2})
chr14
\1 = 14
(\d{2}) July (\d{4})
24 July 1984 \1 = 24
,\2 = 1984
An expression like
s/find/replacement/g
indicates a replacement expression, this will search (s
) for any occurrence offind
, and replace it withreplacement
. It will do this globally (g
) which means it doesn’t stop after the first match.Example:
s/chr(\d{1,2})/CHR\1/g
will replacechr14
withCHR14
etc.You can also use replacement modifier such as convert to lower case
\L
or upper case\U
. Example:s/.*/\U&/g
will convert the whole text to upper case.Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the
s/../../g
structure.There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.
Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.
Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.
Tip: Cyrilex is a visual regular expression tester.
With
[A-Za-z0-9_-]
meaning any character between A to Z, a to z, 0 to 9 or _ or -, the following+
meaning that any of these characters are found once or more.
Convert AB1 sequence files to FASTQ and trim low-quality ends
In Sanger sequencing, ends tend to be of low trust levels (each nucleotide has a quality score reflecting this trust level), it is important to delete these sections of the sequences to ensure wrong nucleotides aren’t introduced in the sequences.
Hands-on: AB1 to FASTQ files and trim low quality endsDo these steps twice !! We have Froward and antisense (reverse) sequence data collections, do these steps starting with each “(filtered)” data collections, this could help:
- Expand one of the output datasets of the tool (by clicking on it)
- Click re-run galaxy-refresh the tool
This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.
- ab1 to FASTQ converter ( Galaxy version 1.20.0) with the following parameters:
- param-collection “Input ab1 file”:
(filtered) output collection
(output of Filter collection tool)- “Do you want trim ends according to quality scores ?”:
No, use full sequences.
In this tool, it is possible to trim low-quality ends along with the conversion of the file but parametrization is less precise.
- seqtk_trimfq ( Galaxy version 1.3.1) with the following parameters:
- param-collection “Input FASTA/Q file”:
output collection
(output of ab1 to FASTQ converter tool)- “Mode for trimming FASTQ File”:
Quality
- “Maximally trim down to INT bp”:
0
Compute reverse complement sequence for antisense (reverse) sequences only
See in the introduction for explanations on the Reverse-Complement.
Hands-on: Reverse complement
- FASTQ Groomer ( Galaxy version 1.1.5) with the following parameters:
- param-collection “File to groom”:
#Reverse output collection
(output of seqtk_trimfq tool)- “Advanced Options”:
Show Advanced Options
- “Summarize input data”:
Do not Summarize Input (faster)
Comment: What is this step?It is a necessary step to get the right input format for the following step Reverse-Complement tool
- Reverse-Complement ( Galaxy version 1.0.2+galaxy0) with the following parameters:
- param-collection “Input file in FASTA or FASTQ format”:
#Reverse output collection
(output of FASTQ Groomer tool)
Merge corresponding sense and antisense sequences single files
Hands-on: Sort collectionsDo this step twice !! One has to make sure sense (forward) and antisense (reverse) sequences collections are in the same order to get the right sense and the right antisense sequence to be merged together
- Expand one of the output datasets of the tool (by clicking on it)
- Click re-run galaxy-refresh the tool
This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.
- Sort collection with the following parameters:
- param-collection “Input Collection”:
Collection
(output of seqtk_trimfq tool & output of Reverse-Complement tool)- “Sort type”:
alphabetical
Hands-on: Merge sense (forward) and antisense (reverse) sequence files
- seqtk_mergepe ( Galaxy version 1.3.1) with the following parameters:
- param-collection “Input FASTA/Q file #1”:
output
(output of Sort collection tool)- param-collection “Input FASTA/Q file #2”:
output
(output of Sort collection tool)Check there is two sequences in each three files of the newly-created collection.
Convert FASTQ files to FASTA
Hands-on: FASTQ to FASTA
- FASTQ Groomer ( Galaxy version 1.1.5) with the following parameters:
- param-collection “File to groom”:
default
(output of seqtk_mergepe tool)- “Advanced Options”:
Show Advanced Options
- “Summarize input data”:
Do not Summarize Input (faster)
Comment: What is this step?It is a necessary step to get the right input format for the following step FASTQ to FASTA tool
- FASTQ to FASTA ( Galaxy version 1.0.2+galaxy2) with the following parameters:
- param-collection “FASTQ file to convert”:
output collection
(output of FASTQ Groomer tool)- “Discard sequences with unknown (N) bases”:
no
- “Rename sequence names in output file (reduces file size)”:
no
- “Compress output FASTA”:
No
Comment: informationIf this step doesn’t work, one can try tools FASTQ to tabular tool and tabular to FASTA tool instead
Align sequences and retrieve consensus for each sequence
Hands-on: Align and consensus
- Align sequences ( Galaxy version 1.9.1.0) with the following parameters:
- param-collection “Input fasta file”:
output collection
(output of FASTQ-to-FASTA tool)- “Method for aligning sequences”:
clustalw
- “Minimum percent sequence identity to closest blast hit to include sequence in alignment”:
0.1
- Consensus sequence from aligned FASTA ( Galaxy version 1.0.0) with the following parameters:
- param-collection “Input fasta file with at least two sequences”:
aligned_sequences
(output of Align sequences tool)- Add tag “#Consensus”
- Merge.files ( Galaxy version 1.39.5.0) with the following parameters:
- “Merge”:
fasta files
- param-collection “inputs - fasta”:
output collection
(output of Consensus sequence from aligned FASTA tool)
Manage primers and sequences
Merge and align consensus sequences file and primer files
Hands-on: Merge and format consensus sequences + primers file
- Merge.files ( Galaxy version 1.39.5.0) with the following parameters:
- “Merge”:
fasta files
- param-files “inputs - fasta”:
consensus sequences
(output of Merge.files tool),Reverse primer
(output of Reverse-Complement tool),Forward primer
(output of Degap.seqs tool)
- Click on param-files Multiple datasets
- Select several files by keeping the Ctrl (or COMMAND) key pressed and clicking on the files of interest
- Remove tags “#Forward” and “#Reverse”
- Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
- param-file “Select lines from”:
output
(output of Merge.files tool)- In “Check”:
- param-repeat “Insert Check”
- “Find Regex”:
([A-Z-])>
- “Replacement”:
\1\n>
Comment: What's going on in this second step?Sometimes, Merge.files tool doesn’t keep linefeed between the files, this step permits to correct it and get a FASTA file that is formatted properly.
For the second step, we used regular expressions (Regex):
Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.
Finding
Below are just a few examples of basic expressions:
Regular expression Matches abc
an occurrence of abc
within your data(abc|def)
abc
ordef
[abc]
a single character which is either a
,b
, orc
[^abc]
a character that is NOT a
,b
, norc
[a-z]
any lowercase letter [a-zA-Z]
any letter (upper or lower case) [0-9]
numbers 0-9 \d
any digit (same as [0-9]
)\D
any non-digit character \w
any alphanumeric character \W
any non-alphanumeric character \s
any whitespace \S
any non-whitespace character .
any character \.
{x,y}
between x and y repetitions ^
the beginning of the line $
the end of the line Note: you see that characters such as
*
,?
,.
,+
etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So\?
matches the question mark character exactly.Examples
Regular expression matches \d{4}
4 digits (e.g. a year) chr\d{1,2}
chr
followed by 1 or 2 digits.*abc$
anything with abc
at the end of the line^$
empty line ^>.*
Line starting with >
(e.g. Fasta header)^[^>].*
Line not starting with >
(e.g. Fasta sequence)Replacing
Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups
(...)
, which we can refer to using\1
,\2
etc for the first and second captured values. If you want to refer to the whole match, use&
.
Regular expression Input Captures chr(\d{1,2})
chr14
\1 = 14
(\d{2}) July (\d{4})
24 July 1984 \1 = 24
,\2 = 1984
An expression like
s/find/replacement/g
indicates a replacement expression, this will search (s
) for any occurrence offind
, and replace it withreplacement
. It will do this globally (g
) which means it doesn’t stop after the first match.Example:
s/chr(\d{1,2})/CHR\1/g
will replacechr14
withCHR14
etc.You can also use replacement modifier such as convert to lower case
\L
or upper case\U
. Example:s/.*/\U&/g
will convert the whole text to upper case.Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the
s/../../g
structure.There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.
Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.
Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.
Tip: Cyrilex is a visual regular expression tester.
With
[A-Z-]
meaning any character between A to Z or -,\1
repeat the character chain between brackets in the “Find Regex” section,\n
meaning a line-feed.
When you have the consensus sequences, you can check if any ambiguous nucleotide is to be found in the sequences. If you find such nucleotides, it means different nucleotides were found in the sense and antisense sequence at the same position, some checks are needed.
- Y = C or T
- R = A or G
- W = A or T
- S = G or C
- K = T or G
- M = C or A
Hands-on: Look for ambiguous nucleotides
Click on output of Regex Find and Replace tool in the history to expand it
Click on galaxy-barchart Visualize
Select Multiple Sequence Alignment
Set color scheme to
Clustal
, ambiguous nucleotides are highlighted in dark blueThere are two nucleotide positions to check, Y at 121 in sequence
consensus_B05_CHD8-III6brother-18
and W at 286 in sequenceconsensus_05_CHD8-III6mother-18
You need to go back to your FASTQ sequences to understand the origin of the ambiguity
- Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
- param-file “Select lines from”:
#Consensus #Primer output
(output of Regex Find and Replace tool)- In “Check”:
- param-repeat “Insert Check”
- “Find Regex”:
^[ACTG]+([ACTG]{20}Y)[ACTG]+$
- “Replacement”:
\1
- param-repeat “Insert Check”
- “Find Regex”:
^[ACTG]+([ACTG]{20}W)[ACTG]+$
- “Replacement”:
\1
Comment: What's going on in this step?We want to retrieve the 20 nucleotides before the ambiguities.
We use regular expressions (Regex):
Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.
Finding
Below are just a few examples of basic expressions:
Regular expression Matches abc
an occurrence of abc
within your data(abc|def)
abc
ordef
[abc]
a single character which is either a
,b
, orc
[^abc]
a character that is NOT a
,b
, norc
[a-z]
any lowercase letter [a-zA-Z]
any letter (upper or lower case) [0-9]
numbers 0-9 \d
any digit (same as [0-9]
)\D
any non-digit character \w
any alphanumeric character \W
any non-alphanumeric character \s
any whitespace \S
any non-whitespace character .
any character \.
{x,y}
between x and y repetitions ^
the beginning of the line $
the end of the line Note: you see that characters such as
*
,?
,.
,+
etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So\?
matches the question mark character exactly.Examples
Regular expression matches \d{4}
4 digits (e.g. a year) chr\d{1,2}
chr
followed by 1 or 2 digits.*abc$
anything with abc
at the end of the line^$
empty line ^>.*
Line starting with >
(e.g. Fasta header)^[^>].*
Line not starting with >
(e.g. Fasta sequence)Replacing
Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups
(...)
, which we can refer to using\1
,\2
etc for the first and second captured values. If you want to refer to the whole match, use&
.
Regular expression Input Captures chr(\d{1,2})
chr14
\1 = 14
(\d{2}) July (\d{4})
24 July 1984 \1 = 24
,\2 = 1984
An expression like
s/find/replacement/g
indicates a replacement expression, this will search (s
) for any occurrence offind
, and replace it withreplacement
. It will do this globally (g
) which means it doesn’t stop after the first match.Example:
s/chr(\d{1,2})/CHR\1/g
will replacechr14
withCHR14
etc.You can also use replacement modifier such as convert to lower case
\L
or upper case\U
. Example:s/.*/\U&/g
will convert the whole text to upper case.Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the
s/../../g
structure.There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.
Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.
Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.
Tip: Cyrilex is a visual regular expression tester.
With
[ACTG]
meaning any character of the four unambiguous nucleotides followed by+
meaning “at least once in the character chain” or by {20} meaning “20 times”.In the output of this tool we get: - the 20 nucleotides before the Y at position 121 in sequence
consensus_B05_CHD8-III6brother-18
:CAGGCACGATGTCATCGAAT
- and the 20 nuleotides before the W at position 286 in sequenceconsensus_05_CHD8-III6mother-18
:AGTCCTCTTAGTTTATAGAT
- FASTQ masker ( Galaxy version 1.1.5) with the following parameters:
- param-collection “File to mask”:
#Forward #Reverse collection
(output of FASTQ groomer tool)- “Mask input with”:
Lowercase
- “Quality score”:
10
This tool displays low-quality bases in lowercase to permit better detection of potential errors.
Open galaxy-eye
B05_CHD8-III6brother-18
output of FASTQ masker tool and ctrl+f :CAGGCACGATGTCATCGAAT
. In the sense sequence (ID ending with 18F), this fragment is followed by ac
in low-quality, whereas in the antisense sequence it is followed by aT
in decent quality. Additionally, when looking into the galaxy-eye#Consensus #Primer
output of Regex Find and Replace tool, we can see the two other consensus sequences (consensus_05_CHD8-III6mother-18
andconsensus_07_CHD8-III6-18
) have aT
at this same position. It seems more likely that the nucleotide at position 121 in sequenceconsensus_B05_CHD8-III6brother-18
is aT
.Open galaxy-eye
05_CHD8-III6mother-18
outputs of FASTQ masker tool and ctrl+f :AGTCCTCTTAGTTTATAGAT
. In the antisense sequence (ID ending with 18R), this fragment is followed by at
in low-quality, whereas in the sense sequence it is followed by aA
in decent quality. Additionally, when looking into the galaxy-eye#Consensus #Primer
output of Regex Find and Replace tool, we can see the two other consensus sequences (consensus_B05_CHD8-III6brother-18
andconsensus_07_CHD8-III6-18
) have aA
at this same position. It seems more likely that the nucleotide at position 286 in sequenceconsensus_05_CHD8-III6mother-18
is aA
.You can now correct them by clicking on output of Regex Find and Replace tool in the history to expand it
Click on galaxy-barchart Visualize
Select Editor and:
- replace manually the
Y
withT
inconsensus_B05_CHD8-III6brother-18
- replace manually the
W
withA
inconsensus_05_CHD8-III6mother-18
and click on export
Now, one can align its sequences with primers. Ultimately, it is common to cut sequences between primers to get the right fragment for each sequence.
Hands-on: Align sequences and primers
- Align sequences ( Galaxy version 1.9.1.0) with the following parameters:
- param-file “Input fasta file”:
out_file1
Regex Find And Replace (modified)- “Method for aligning sequences”:
mafft
- “Minimum percent sequence identity to closest blast hit to include sequence in alignment”:
0.1
Check your sequences belongs to the right taxonomic group by computing a BLAST on the NCBI database
Hands-on: NVBI Blast
- NCBI BLAST+ blastn ( Galaxy version 2.10.1+galaxy2) with the following parameters:
- param-file “Nucleotide query sequence(s)”:
out_file1
(output of Regex Find And Replace tool)- “Subject database/sequences”:
Locally installed BLAST database
- “Nucleotide BLAST database”: select most recent
nt_
database- “Output format”:
Tabular (select which columns)
- “Standard columns”:
qseqid
,pident
,mismatch
andgapopen
- “Extended columns”:
gaps
andsalltitles
- “Other identifier columns”:
saccver
- “Advanced Options”:
Show Advanced Options
- “Maximum hits to consider/show”:
10
- “Restrict search of database to a given set of ID’s”:
No restriction, search the entire database
QuestionThe sequences we cleaned belong to what species?
Homo sapiens
It is a good practice to proceed to such checks, its permits to make sure the sequencing went as planned and your samples haven’t been contaminated.
Conclusion
We successfully cleaned AB1 sequence files !
AOPEP Sanger files
The history following the same steps but for AOPEP marker files is available: Clean AOPEP sequences