Converting between common single cell data formats
Author(s) | Julia Jakiela Morgan Howells Wendi Bacon |
Editor(s) | Helena Rasche |
Tester(s) | Pavankumar Videm Mehmet Tekman |
Reviewers |
OverviewQuestions:Objectives:
What are the most popular single cell datatypes?
What if the format of my files is different than that used in the tutorial I want to follow?
Where should I start the analysis, depending on the format of my data?
How do I ingest data into Galaxy?
How do I convert datasets between formats?
Requirements:
You will identify different single cell files formats.
You will import single cell data into Galaxy using different methods.
You will manipulate the metadata and matrix files.
You will perform conversions between the most common single cell formats.
You will downsample FASTQ files.
Time estimation: 1 hourSupporting Materials:Published: Feb 13, 2024Last modification: Apr 10, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00418version Revision: 2
You finally decided to analyse some single cell data, you got your files either from the lab or publicly available sources, you opened the first tutorial available on Galaxy Training Network and… you hit the wall! The format of your files is not compatible with the one used in tutorial! Have you been there? This tutorial was created to help you overcome that problem and ensure data interoperability in single cell analysis. Once you get your data into Galaxy in the right format, that’s already 50% of success. Additionally, by using format conversion, you will be able to use different packages presented in tutorials that may require different datatypes.
AgendaIn this tutorial, we will cover:
Single cell datatypes
To start with, here are the most common formats and datatypes that you might come across if you work with single cell data:
- Tabular - simply using TSV, CSV or TXT formats to store expression matrix as well as cell and gene metadata.
- MTX - it’s just a sparse matrix format with genes on the rows and cells on the columns as output by Cell Ranger.
- HDF5 - Hierarchical Data Format - can store datasets and groups. A dataset is a a multidimensional array of data elements, together with supporting metadata. A group is a structure for organizing objects in an HDF5 file. This format allows for storing both the count matrices and all metadata in a single file rather than having separate features, barcodes and matrix files.
- AnnData objects - anndata is a Python package for handling annotated data matrices. In Galaxy, you’ll see AnnData objects in h5ad format, which is based on the standard HDF5 (h5) format. There are lots of Python tools that work with this format, such as Scanpy, MUON, Cell Oracle, SquidPy, etc.
- Loom - it is simply an HDF5 file that contains specific groups containing the main matrix as well as row and column attributes and can be read by any language supporting HDF5. Loompy has been released as a Python API to interact with loom files, and loomR is its implementation in R.
- Zarr - a Python package providing an implementation of compressed, chunked, N-dimensional arrays, designed for use in parallel computing. The Zarr file format offers powerful compression options, supports multiple data store backends, and can read/write your NumPy arrays.
- Seurat objects - a representation of single-cell expression data for R, in Galaxy you might see them in rdata format.
- Single Cell Experiment (SCE) object - defines a S4 class for storing data from single-cell experiments and provides a more formalized approach towards construction and accession of data. The S4 system is one of R’s systems for object oriented programing. In Galaxy you might see SCE objects in rdata format.
- CellDataSet (CDS) object - the main class used by Monocle to hold single cell expression data. In Galaxy you might see CDS objects in rdata format.
Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.
- Open your Galaxy server
- Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
- Navigate to your tutorial
- Tool names in tutorials will be blue buttons that open the correct tool for you
- Note: this does not work for all tutorials (yet)
- You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
- We’ve had some issues with Tutorial mode on Safari for Mac users.
- Try a different browser if you aren’t seeing the button.
Data import
As you can see above, there are multiple ways to store single cell data. Therefore, there are also many ways how you can get that data! Obviously, before any format conversion, we need to import the data. In our tutorials we often use Zenodo links, but you can also upload the files directly from your computer. There are also publicly available resources which you can easily access through public atlases, such as Single Cell Expression Atlas or Human Cell Atlas data portal. We created a dedicated tutorial to show how to use those atlases to retrieve data. But today we’re here to focus on data conversion!
SCEasy Tool
In Galaxy Toolshed there is a wonderful tool called SCEasy ( Galaxy version 0.0.7+galaxy2) which allows you to convert between common single cell formats, such as:
- AnnData to CellDataSet (CDS)
- AnnData to Seurat
- Loom to AnnData
- Loom to SingleCellExperiment (SCE)
- SingleCellExperiment (SCE) to AnnData
- SingleCellExperiment (SCE) to Loom
- Seurat to AnnData
- Seurat to SingleCellExperiment (SCE)
Warning: Two SCEasy toolsAs of the writing of this tutorial, the updated SCEasy tool is called SCEasy Converter ( Galaxy version 0.0.7+galaxy2) and it’s only available on usegalaxy.eu. The second tool is called SCEasy convert ( Galaxy version 0.0.5+galaxy1) and it works on usegalaxy.org, however has limited conversion options. Both tools should be visible on singlecell.usegalaxy.eu if you try to find them in the search box.
In this tutorial you will see multiple examples of SCEasy Tool in action. However, sometimes it is useful to know how to do this conversion manually or at least to know how it all works and better understand the structure of the files. Therefore, we also have examples showing how to convert objects manually or prepare the input files for some of our single cell tutorials and workflows.
Seurat -> AnnData
As mentioned above, there are two SCEasy tools currently available. When we speak about conversion to AnnData, the latest tool SCEasy Converter ( Galaxy version 0.0.7+galaxy2) converts Seurat into the latest version of AnnData object. However, currently it is not possible to use that output with downstream EBI single-cell tools because they only support older versions of the AnnData. If you need to use those tools, the solution is to take advantage of the second SCEasy tool SCEasy convert ( Galaxy version 0.0.5+galaxy1) which has limited conversion options but can generate AnnData files compatible with the EBI single-cell tools that we are going to use next in the workflow. Our Seurat starting file was generated by pulling data from Single Cell Expression Atlas and transforming it into the desired format, what was shown in the data import tutorial. It will be our toy dataset.
Hands-on: Get toy data
- Create a new history for this tutorial
Import the AnnData object from Zenodo
https://zenodo.org/records/10397653/files/Seurat_object.rdata
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
Alternatively, you can import history where we created the Seurat object: Input history
- Open the link to the shared history
- Click on the Import this history button on the top left
- Enter a title for the new history
- Click on Copy History
- Rename galaxy-pencil the dataset
Seurat object
Check that the datatype is
rdata
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
rdata
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Hands-on: Choose Your Own TutorialThis is a "Choose Your Own Tutorial" section, where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial
Choose below if you just want to convert your object quickly or see how it all happens behind the scenes!
Quick one tool method
Hands-on: SCEasy Converter
- SCEasy Converter ( Galaxy version 0.0.7+galaxy2) with the following parameters:
- “Convert From / To”:
Seurat to AnnData
- param-file “Input object in rds,rdata format”:
Seurat object
- Rename galaxy-pencil the output:
AnnData object
And that’s it! Please note that the output file is the newest AnnData version and is not compatible with EBI tools used in Filter, Plot, Explore workflow.
Convert to AnnData object compatible with Filter, Plot, Explore workflow
Here, we use the other SCEasy tool which generates older version of the AnnData which is compatible with the EBI single-cell tools that we are going to use next in the workflow.
Hands-on: Seurat to AnnData with SCEasy convert
- SCEasy convert ( Galaxy version 0.0.5+galaxy1) with the following parameters:
- “Direction of conversion”:
Seurat to AnnData
- param-file “Input object in Seurat RDS format”:
Seurat object
(if the dataset does not show up in the corresponding input field, just drag the dataset from the history panel and drop into the input field)- “Name of the assay to be transferred”:
RNA
- “Data type of the assay to be transferred”:
data
Rename galaxy-pencil the output
Converted EBI-compatible AnnData file
.Check that the datatype is
h5ad
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
h5ad
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Now, if we do small modification to metadata, we can use the generated dataset as an input file in the workflow we created in the previous tutorial! Let’s see how it works:
Hands-on: Modify AnnData object
- AnnData Operations ( Galaxy version 1.8.1+galaxy92)
- Make sure you are using version 1.8.1+galaxy92 of the tool (change by clicking on tool-versions Versions button)
- Set the following parameters:
- param-file In “Input object in hdf5 AnnData format”:
Converted EBI-compatible AnnData file
- In “Change field names in AnnData var”:
- param-repeat “Insert Change field names in AnnData var”
- “Original name”:
name
- “New name”:
Symbol
- “Gene symbols field in AnnData”:
Symbol
- In “Flag genes that start with these names”:
- param-repeat “Insert Flag genes that start with these names”
- “Starts with”:
mt-
- “Var name”:
mito
- Rename galaxy-pencil output
AnnData for Filter, Plot, Explore workflow
The object is ready to start its journey through the Filter, Plot, Explore workflow. All thanks to one conversion tool - how awesome that is!
AnnData -> Seurat
Let’s get an AnnData object that we can further work on. It’s the object used in many tutorials, so check it out if you’re curious.
Hands-on: Get toy data
- Create a new history for this tutorial
Import the AnnData object from Zenodo
If you do this tutorial just for learning purposes, you can download the downsampled dataset which will be much quicker to process:
https://zenodo.org/record/10391629/files/Downsampled_annotated_AnnData.h5ad
If you want to use the full dataset used in the other single-cell case study tutorials, here it is! Please note, it will take much longer to process it, so we will only show the conversions on the downsampled objects.
https://zenodo.org/record/7053673/files/Mito-counted_AnnData
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
- Rename galaxy-pencil the datasets
Downsampled AnnData object
Check that the datatype is
h5ad
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
h5ad
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Hands-on: Choose Your Own TutorialThis is a "Choose Your Own Tutorial" section, where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial
Choose below if you just want to convert your object quickly or see how it all happens behind the scenes!
Quick one step method
Hands-on: SCEasy Converter
- SCEasy Converter ( Galaxy version 0.0.7+galaxy2) with the following parameters:
- “Convert From / To”:
AnnData to Seurat
- param-file “Input object in h5ad,h5 format”:
Downsampled AnnData object
- Rename galaxy-pencil the output:
Seurat object
Manual conversion
Most of our manual conversions involve extracting tables from different data objects and importing them into the target object. First, we will extract observations (cell metadata) and the full matrix from our AnnData.
Hands-on: Inspect AnnData
- Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
Downsampled AnnData object
- “What to inspect?”:
Key-indexed observations annotation (obs)
Rename galaxy-pencil the output
Observations
.- Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
Downsampled AnnData object
- “What to inspect?”:
The full data matrix
- Rename galaxy-pencil the output
Matrix
.
QuestionWhat are the rows, and what are the columns in the retrieved Matrix?
If you just click on the
Matrix
dataset, you will see a preview showing barcodes in the first column, and genes in the first row.
However, the next tool we need expects a matrix wherein the genes are listed in the first column and the barcodes are listed in the first row. Therefore, we need to transpose the current matrix.
Hands-on: Transpose the matrix
- Transpose ( Galaxy version 1.8+galaxy0) with the following parameters:
- param-file “Input tabular dataset”:
Matrix
And now we are ready to input that data into DropletUtils tool, which will separate this matrix into the cells, genes, and matrix tabular files needed to build a Seurat object.
Hands-on: DropletUtils
- DropletUtils ( Galaxy version 1.10.0+galaxy2) with the following parameters:
- “Format for the input matrix”:
Tabular
- param-file “Count Data”: output of Transpose tool
- “Operation”:
Filter for Barcodes
- “Method”:
DefaultDrops
- “Expected Number of Cells”:
338
- “Upper Quantile”:
1.0
- “Lower Proportion”:
0.0
- “Format for output matrices”:
Bundled (barcodes.tsv, genes.tsv, matrix.mtx)
- “Random Seed”:
100
Finally, let’s combine those files that we have just generated and turn them into the Seurat object!
Hands-on: Create Seurat object
- Seurat Read10x ( Galaxy version 3.2.3+galaxy0) with the following parameters:
- “Choose the format of the input”:
10X-type MTX
- param-file “Expression matrix in sparse matrix format (.mtx)”:
DropletUtils 10X Matrices
- “Gene table”:
DropletUtils 10X Genes
- “Barcode/cell table”:
DropletUtils 10X Barcodes
- param-file “Cell Metadata”:
Observations
- “Choose the format of the output”:
RDS with a Seurat object
- Rename galaxy-pencil the output
Converted Seurat object
.
As usual, you can check the example history and the dedicated workflow.
AnnData -> SingleCellExperiment (SCE)
We will work on the same AnnData object so if you create a new history for this exercise, you can either get this file from Zenodo again or just copy this dataset from the previous history.
Hands-on: Get toy data
- Create a new history for this section
Import the files from Zenodo
https://zenodo.org/record/10391629/files/Downsampled_annotated_AnnData.h5ad
There 3 ways to copy datasets between histories
From the original history
- Click on the galaxy-gear icon which is on the top of the list of datasets in the history panel
- Click on Copy Datasets
Select the desired files
Give a relevant name to the “New history”
- Validate by ‘Copy History Items’
- Click on the new history name in the green box that have just appear to switch to this history
Using the galaxy-columns Show Histories Side-by-Side
- Click on the galaxy-dropdown dropdown arrow top right of the history panel (History options)
- Click on galaxy-columns Show Histories Side-by-Side
- If your target history is not present
- Click on ‘Select histories’
- Click on your target history
- Validate by ‘Change Selected’
- Drag the dataset to copy from its original history
- Drop it in the target history
From the target history
- Click on User in the top bar
- Click on Datasets
- Search for the dataset to copy
- Click on its name
- Click on Copy to current History
Hands-on: Choose Your Own TutorialThis is a "Choose Your Own Tutorial" section, where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial
Choose below if you just want to convert your object quickly or see how it all happens behind the scenes!
Quick single tool method
Hands-on: SCEasy Converter
- SCEasy Converter ( Galaxy version 0.0.7+galaxy2) with the following parameters:
- “Convert From / To”:
AnnData to Seurat
- param-file “Input object in h5ad,h5 format”:
Downsampled AnnData object
- Rename galaxy-pencil the output:
Seurat object
- We will use SCEasy again! SCEasy Converter ( Galaxy version 0.0.7+galaxy2) with the following parameters:
- “Convert From / To”:
Seurat to SingleCellexperiment
- param-file “Input object in rds,rdata format”:
Seurat object
- Rename galaxy-pencil the output:
SCE object
Convert manually
First, we will extract observations and the full matrix from our AnnData.
If you are following this entire tutorial (rather than using a specific section necessary!), you can stay in the your previous history and just reuse outputs to build different single cell objects!
Hands-on: Inspect AnnData
- Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
Downsampled AnnData object
- “What to inspect?”:
Key-indexed observations annotation (obs)
Rename galaxy-pencil the output
Observations
.- Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
Downsampled AnnData object
- “What to inspect?”:
The full data matrix
- Rename galaxy-pencil the output
Matrix
.
QuestionWhat are the rows, and what are the columns in the retrieved Matrix?
If you just click on the
Matrix
dataset, you will see a preview, showing barcodes in the first column, while genes are in the first row.
However, the next tool we need expects a matrix wherein the genes are listed in the first column and the barcodes are listed in the first row. Therefore, we need to transpose the current matrix.
Hands-on: Transpose the matrix
- Transpose ( Galaxy version 1.8+galaxy0) with the following parameters:
- param-file “Input tabular dataset”:
Matrix
And now we are ready to input that data to DropletUtils tool.
Hands-on: DropletUtils
- DropletUtils ( Galaxy version 1.10.0+galaxy2) with the following parameters:
- “Format for the input matrix”:
Tabular
- param-file “Count Data”: output of Transpose tool
- “Operation”:
Filter for Barcodes
- “Method”:
DefaultDrops
- “Expected Number of Cells”:
338
- “Upper Quantile”:
1.0
- “Lower Proportion”:
0.0
- “Format for output matrices”:
Bundled (barcodes.tsv, genes.tsv, matrix.mtx)
- “Random Seed”:
100
Finally, let’s combine those files that we have just generated and turn them into the SingleCellExperiment!
Hands-on: Create SCE object
- DropletUtils Read10x ( Galaxy version 1.0.4+galaxy0) with the following parameters:
- param-file “Expression matrix in sparse matrix format (.mtx)”:
DropletUtils 10X Matrices
- param-file “Gene table”:
DropletUtils 10X Genes
- param-file “Barcode/cell table”:
DropletUtils 10X Barcodes
- “Should metadata file be added?”: param-toggle
Yes
- param-file “Metadata file”:
Observations
- “Cell ID column”:
index
- Rename galaxy-pencil the output
Converted SCE object
.
As usual, you can check the example history and the dedicated workflow.
Anndata -> Cell Data Set (CDS)
Cell Data Set (CDS) format is usually used when working with a package called Monocle3 (cole-trapnell-lab). Below we show two methods on how to transform AnnData to CDS object, one of which creates an input file for Monocle 3 tutorial.
Hands-on: Choose Your Own TutorialThis is a "Choose Your Own Tutorial" section, where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial
You can choose whether you want just to transform AnnData to CDS or to create CDS input file for Monocle3 tutorial to proceed with the downstream analysis described in that tutorial. Please note that depending on your dataset, you might need to refer to the first method which uses both annotated and unprocessed matrices. If you did some pre-processing on your AnnData object, then you might need to choose the first method anyway since Monocle3 performs its own pre-processing, so we would also need an unprocessed expression matrix alongside annotated (pre-processed) AnnData. That method is more expanded and specific, while the general one just shows the main principle of the conversion.
CDS input for Monocle3 tutorial (use for pre-processed data)
The dedicated tutorial shows how to perform trajectory analysis using Monocle3 which is the next step in the single-cell case study tutorial series, right after pre-processing tutorial and analysing the metadata. To keep the continuity of the series, we will continue to work on the case study data from a mouse model of fetal growth restriction Bacon et al. 2018 (see the study in Single Cell Expression Atlas and the project submission). After successfully completing this section with the mentioned dataset, you can use it directly in the Monocle3 tutorial workflow. If you work on your own data, you might also need to follow this method since it shows how to deal with already pre-processed datasets. Since Monocle3 performs its own pre-processing, you will need both annotated and unprocessed matrices.
Get data
Monocle3 works great with annotated data, so we will make use of our annotated AnnData object, generated in the previous tutorial. We will also need a ‘clean’ expression matrix, extracted from the AnnData object just before we started the processing. You have two options for uploading these datasets. Importing via history is often faster.
Hands-on: Option 1: Data upload - Import history
Import history from: input history
- Open the link to the shared history
- Click on the Import this history button on the top left
- Enter a title for the new history
- Click on Copy History
Rename galaxy-pencil the history to your name of choice.
Hands-on: Option 2: Data upload - Add to history
- Create a new history for this tutorial
Import the AnnData object from Zenodo
https://zenodo.org/records/7078524/files/AnnData_before_processing.h5ad https://zenodo.org/records/7078524/files/Annotated_AnnData.h5ad
Check that the datatype is
h5ad
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
h5ad
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Extracting annotations
To run Monocle, we need cell metadata, gene metadata, and an expression matrix file of genes by cells. (In theory, the expression matrix alone could do, but then we wouldn’t have all those useful annotations that we worked on so hard in the previous tutorials!). In order to get these files, we will extract the gene and cell annotations from our AnnData object.
QuestionHow many lines do you expect to be in the gene and cell metadata files?
If you click on the step with uploaded annotated AnnData file, you will see on a small preview that this object has 8605 observations and 15395 variables, so we expect to get a cell metadata file with 8605 lines and gene metadata file with 15395 lines (without headers of course!).
Hands-on: Extracting annotations
- Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
Annotated_AnnData
- “What to inspect?”:
Key-indexed observations annotation (obs)
Rename galaxy-pencil the observations annotation
Extracted cell annotations (obs)
- Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
Annotated_AnnData
- “What to inspect?”:
Key-indexed annotation of variables/features (var)
- Rename galaxy-pencil the annotation of variables
Extracted gene annotations (var)
Quick and easy, isn’t it? However, we need to make some minor changes before we can input these files into the Monocle toolsuite.
Cell metadata
Our current dataset is not just T-cells: as you might remember from the last tutorial, we identified a cluster of macrophages as well. This might be a problem, because the trajectory algorithm will try to find relationships between all the cells (even if they are not necessarily related!), rather than only the T-cells that we are interested in. We need to remove those unwanted cell types to make the analysis more accurate.
The Manipulate AnnData tool allows you to filter observations or variables, and that would be the most obvious way to remove those cells. However, given that we don’t need an AnnData object, it’s a lot quicker to edit a table rather than manipulate an AnnData object. Ultimately, we need cell metadata, gene metadata and expression matrix files that have macrophages removed, and that have the correct metadata that Monocle looks for. With some table manipulation, we’ll end up with three separate files, ready to be passed onto Monocle3.
QuestionWhere is the information about cell types stored?
We have already extracted the cell annotations file - in one of the columns you can find the information about cell type, assigned to each cell.
Click on Extracted cell annotations (obs)
file to see a small preview window. This shows you that the column containing the cell types has number 22. We’ll need that to filter out unwanted cell types!
Warning: Check the column number!If you are working on a different dataset, the number of the ‘cell_type’ column might be different, so make sure you check it on a preview and use the correct number!
Hands-on: Filter out macrophages
- Filter with the following parameters:
- param-file “Filter”:
Extracted cell annotations (obs)
- “With following condition”:
c22!='Macrophages'
- “Number of header lines to skip”:
1
- That’s it - our cell annotation file is ready for Monocle! Let’s rename it accordingly.
Rename galaxy-pencil the output:
Cells input data for Monocle3
c22
means column no. 22 - that’s the column with cell types, and it will be filtered for the macrophages!=
means ‘not equal to’ - we want to keep the cell types which ARE NOT macrophagesIt might happen that during clustering you’ll find another cell type that you want to get rid of for the trajectory analysis. Then simply re-run this tool on already filtered file and change ‘Macrophages’ to another unwanted cell type.
Gene annotations
Sometimes certain functionalities require a specific indication of where the data should be taken from. Monocle3 tools expect that the genes column is named ‘gene_short_name’. Let’s check what the name of that column is in our dataset currently.
Question
- Where can you check the header of a column containing genes names?
- What is the name of this column?
- Our extracted gene annotations file! Either by clicking on the eye icon solution or having a look at the small preview window.
- In our dataset the gene names are stored in a column called ‘Symbol’ - we need to change that!
Let’s click on the Extracted gene annotations (var)
file to see a small preview. We can see that the gene names are in the third column with a header Symbol
. Keep that in mind - we’ll use that in a second!
Hands-on: Changing the column name
- Column Regex Find And Replace ( Galaxy version 1.0.2) with the following parameters:
- param-file “Select cells from”:
Extracted gene annotations (var)
- “using column”:
c3
orColumn: 3
- In “Check”:
- param-repeat “Insert Check”
- “Find Regex”:
Symbol
- “Replacement”:
gene_short_name
Check that the datatype is
tabular
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
tabular
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
- Voila! That’s the gene input for Monocle! Just a quick rename…
- Rename galaxy-pencil the output:
Genes input data for Monocle3
Expression matrix
Last, but not least! And in fact, the most important! The expression matrix contains all the values representing the expression level of a particular gene in a cell. This is why in theory the expression matrix is the only input file required by Monocle3. Without annotation files the CDS data can still be generated - it will be quite bare and rather unhelpful for interpretation, but it’s possible to process. So, the values in the expression matrix are just numbers. But do you remember that we have already done some processing such as normalisation and the calculation of principal components in the AnnData object in the previous tutorial? That affected our expression matrix. Preprocessing is one of the steps in the Monocle3 workflow, so we want to make sure that the calculations are done on a ‘clean’ expression matrix. If we apply too many operations on our raw data, it will be too ‘deformed’ to be reliable. The point of the analysis is to use algorithms that make the enormous amount of data understandable in order to draw meaningful conclusions in accordance with biology. So how do we do that?
Question
- How many cells and genes are there in the
Anndata_before_processing
file?- How many lines are there in
Cells input data for Monocle3
?- How many lines are there in
Genes input data for Monocle3
?You can answer all the questions just by clicking on the given file and looking at the preview window.
- [n_obs x n_vars] = 31178 x 35734, so there are 31178 cells and 35734 genes.
- 8570 lines, including a header, which makes 8569 cells.
- 15396 lines, including a header, which makes 15395 genes.
As you can see, there are way more genes and cells in the unprocessed AnnData file, so the expression matrix is much bigger than we need it to be. If the genes and cells we prepared for Monocle3 are not the same as in the expression matrix, Monocle3 will crash. Therefore, we have to filter that big, clean matrix and adjust it to our already prepared genes and cells files. But first, let’s extract the matrix from the unprocessed AnnData object.
Hands-on: Extracting matrix
- Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
AnnData_before_processing
- “What to inspect?”:
The full data matrix
- Rename galaxy-pencil the output:
Unprocessed expression matrix
If you have a look at the preview of Unprocessed expression matrix
, you’ll see that the first column contains the cell barcodes, while the first row - the gene IDs. We would like to keep only the values corresponding to the cells and genes that are included in Cells input data for Monocle3
and Genes input data for Monocle3
. How do we do it? First, we compare the cell barcodes from Cells input data for Monocle3
to those in Unprocessed expression matrix
and ask Galaxy to keep the values of the matrix for which the barcodes in both files are the same. Then, we’ll do the same for gene IDs. We will cut the first columns from Cells input data for Monocle3
and Genes input data for Monocle3
to be able to compare those columns side by side with the matrix file.
Hands-on: Cutting out the columns
- Cut with the following parameters:
- “Cut columns”:
c1
- param-file “From”:
Cells input data for Monocle3
- Rename galaxy-pencil the output:
Cells IDs
- Cut with the following parameters:
- “Cut columns”:
c1
- param-file “From”:
Genes input data for Monocle3
- Rename galaxy-pencil the output:
Genes IDs
Hands-on: Filter matrix values by cell barcodes
- Join two Datasets with the following parameters:
- param-file “Join”:
Cells IDs
- “using column”:
c1
orColumn: 1
- param-file “with”:
Unprocessed expression matrix
- “and column”:
c1
orColumn: 1
- “Keep lines of first input that do not join with second input”:
Yes
- “Keep lines of first input that are incomplete”:
Yes
- “Fill empty columns”:
No
- “Keep the header lines”:
Yes
- Rename galaxy-pencil the output:
Pre-filtered matrix (by cells)
Look at the preview of the output file. First of all, you can see that there are 8570 lines (8569 cells) instead of 31178 cells that were present in the matrix. That’s exactly what we wanted to achieve - now we have raw information for the T-cells that we have filtered. However, the step that we have already performed left us with the matrix whose first and second columns are the same - let’s get rid of one of those!
Hands-on: Remove duplicate column (cells IDs)
- Advanced Cut ( Galaxy version 1.1.0) with the following parameters:
- param-file “File to cut”:
Pre-filtered matrix (by cells)
- “Operation”:
Discard
- “Cut by”:
fields
- “List of Fields”:
c1
- Rename galaxy-pencil the output:
Filtered matrix (by cells)
Now we will perform the same steps, but for gene IDs. But gene IDs are currently in the first row, so we need to transpose the matrix, and from there we can repeat the same steps as above for Gene IDs.
Hands-on: Filter matrix by gene IDs
- Transpose ( Galaxy version 1.1.0+galaxy2) with the following parameters:
- param-file “Input tabular dataset”:
Filtered matrix (by cells)
- The matrix is now ready to be filtered by gene IDs!
- Join two Datasets with the following parameters:
- param-file “Join”:
Genes IDs
- “using column”:
c1
orColumn: 1
- param-file “with”: output of Transpose tool
- “and column”:
c1
orColumn: 1
- “Keep lines of first input that do not join with second input”:
Yes
- “Keep lines of first input that are incomplete”:
Yes
- “Fill empty columns”:
No
- “Keep the header lines”:
Yes
- Advanced Cut ( Galaxy version 1.1.0) with the following parameters:
- param-file “File to cut”: output of Join two Datasets tool
- “Operation”:
Discard
- “Cut by”:
fields
- “List of Fields”:
c1
- Monocle3 requires that in the matrix rows are genes, and columns are cells - that is what we’ve got, so there is no need to transpose matrix again. The expression matrix is ready! Let’s just rename it…
- Rename galaxy-pencil the output:
Expression matrix for Monocle3
congratulations Finally! We have prepared all the files to pass them onto the Monocle3 workflow!
Creating CDS object
Monocle3 turns the expression matrix, cell and gene annotations into an object called cell_data_set (CDS), which holds single-cell expression data.
Here is what Monocle3 documentation says about the required three input files:
- expression_matrix: a numeric matrix of expression values, where rows are genes, and columns are cells. Must have the same number of columns as the cell_metadata has rows and the same number of rows as the gene_metadata has rows.
- cell_metadata: a data frame, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)
- gene_metadata: a data frame, where rows are features (e.g. genes), and columns are gene attributes, such as biotype, gc content, etc. One of its columns should be named “gene_short_name”, which represents the gene symbol or simple name (generally used for plotting) for each gene.
Hands-on: Create CDS objectYou can provide expression matrix as TSV, CSV, MTX or RDS file, while genes and cells metadata as TSV, CSV or RDS files. In our case all three files are tabular, so we will set the format to TSV.
- Monocle3 create ( Galaxy version 0.1.4+galaxy2) with the following parameters:
- param-file “Expression matrix, genes as rows, cells as columns. Required input. Provide as TSV, CSV or RDS.”:
Expression matrix for Monocle3
- “Format of expression matrix”:
TSV
- param-file “Per-cell annotation, optional. Row names must match the column names of the expression matrix. Provide as TSV, CSV or RDS.”:
Cells input data for Monocle3
- “Format of cell metadata”:
TSV
- param-file “Per-gene annotation, optional. Row names must match the row names of the expression matrix. Provide as TSV, CSV or RDS.”:
Genes input data for Monocle3
- “Format of gene annotation”:
TSV
- Rename galaxy-pencil the output:
CDS input for Monocle3 tutorial
It was quite a long conversion, but we did it! If you’re interested, the “Tip” below describes how we could possibly speed up the process using alternative tools.
Generally in coding you can use different ways to achieve the same result. Similarly, in Galaxy you can use different tools but the outcome will be the same. For example, we could have used some tools on our starting AnnData object to remove macrophages and rename the column header, and afterwards extract observations and variables. It is a good practice to use methods from the libraries (such as Scanpy below) rather than dealing with text files rows and columns. The shown approach works well, but it might be problematic with much bigger datasets sizes. Then, you can use the following route:
- Scanpy FilterCells ( Galaxy version 1.8.1+galaxy93) on annotated Anndata to remove macrophages.
- AnnData Operations ( Galaxy version 1.8.1+galaxy92) to change the genes name from Symbol to gene_short_name.
- Inspect AnnData ( Galaxy version 0.10.3+galaxy0) to extract genes and cells (to be used for Monocle).
- Follow further steps to filter unprocessed matrix.
You’ve probably used the mentioned tools before, so challenge yourself and try to replicate the process using the tools above!
You might want to consult your results with this control history, or check out the workflow. And now you have your input file ready to start the Monocle3 tutorial!
General AnnData to CDS conversion
In fact, you can do this conversion just in one step - check the tip Tip below!
As mentioned in the previous section, SCEasy tool can do this conversion in a single step. Let’s try that out!
Hands-on: SCEasy Converter
- SCEasy Converter ( Galaxy version 0.0.7+galaxy2) with the following parameters:
- “Convert From / To”:
AnnData to CellDataSet
- param-file “Input object in h5ad,h5 format”:
Downsampled AnnData object
- Rename galaxy-pencil the output:
CDS object
Below we will also show the manual conversion, just to make you familiar with the structure of the files and operations needed. We will continue working on previously used dataset, so you can copy it from your history or download from Zenodo.
Hands-on: Get toy data again
Create a new history for this tutorial
You can either copy the previously used dataset from your history:
There 3 ways to copy datasets between histories
From the original history
- Click on the galaxy-gear icon which is on the top of the list of datasets in the history panel
- Click on Copy Datasets
Select the desired files
Give a relevant name to the “New history”
- Validate by ‘Copy History Items’
- Click on the new history name in the green box that have just appear to switch to this history
Using the galaxy-columns Show Histories Side-by-Side
- Click on the galaxy-dropdown dropdown arrow top right of the history panel (History options)
- Click on galaxy-columns Show Histories Side-by-Side
- If your target history is not present
- Click on ‘Select histories’
- Click on your target history
- Validate by ‘Change Selected’
- Drag the dataset to copy from its original history
- Drop it in the target history
From the target history
- Click on User in the top bar
- Click on Datasets
- Search for the dataset to copy
- Click on its name
- Click on Copy to current History
Or, alternatively, download the dataset from Zenodo
https://zenodo.org/record/10391629/files/Downsampled_annotated_AnnData.h5ad
- Rename galaxy-pencil the datasets
Downsampled AnnData object
Check that the datatype is
h5ad
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
h5ad
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Now we just need to extract information about cells, genes and an expression matrix.
Hands-on: Inspect AnnData
- Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
Downsampled AnnData object
- “What to inspect?”:
Key-indexed observations annotation (obs)
Rename galaxy-pencil the output
Cell barcodes (obs)
.
- Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
Downsampled AnnData object
- “What to inspect?”:
Key-indexed annotation of variables/features (var)
Rename galaxy-pencil the output
Genes (var)
.
- Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
- param-file “Annotated data matrix”:
output
(Input dataset)- “What to inspect?”:
The full data matrix
Rename galaxy-pencil the output
Expression matrix
.
Hold on here! As mentioned, if you’re converting your files to CDS, you’ll probably be working with Monocle. There is one function in downstream analysis in Monocle that requires a specific name of the column containing gene symbols, and that is gene_short_name
. If you use Galaxy buttons for the analysis, you won’t be able to change that name after you create CDS file, so a good piece of advice is to rename it at this stage. There is no harm in doing this, and it might save you some time and frustration later on. You only need to check which column contains the gene symbols and what is its header - you can check that in the preview window, simply by clicking on the Genes
dataset. In our case, that’s column 3 and its name is Symbol
. Let’s change that!
Hands-on: Changing the column name
- Column Regex Find And Replace ( Galaxy version 1.0.2) with the following parameters:
- param-file “Select cells from”:
Genes
- “using column”:
c3
orColumn: 3
- In “Check”:
- param-repeat “Insert Check”
- “Find Regex”:
Symbol
- “Replacement”:
gene_short_name
Check that the datatype is
tabular
. If not, change it.
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
tabular
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
- Rename galaxy-pencil the output:
Genes renamed
We’re almost there, but there is one last modification we have to do - transpose the matrix to have the genes as rows and cells as columns.
Hands-on: Transpose the matrix
- Transpose ( Galaxy version 1.8+galaxy1) with the following parameters:
- param-file “Input tabular dataset”:
Expression matrix
And the final step is to create the CDS file using Monocle tool!
Hands-on: Create Cell Data Set
- Monocle3 create ( Galaxy version 0.1.4+galaxy2) with the following parameters:
- param-file “Expression matrix, genes as rows, cells as columns. Required input. Provide as TSV, CSV or RDS.”: output of Transpose tool
- “Format of expression matrix”:
TSV
- param-file “Per-cell annotation, optional. Row names must match the column names of the expression matrix. Provide as TSV, CSV or RDS.”:
Cell barcodes (obs)
- “Format of cell metadata”:
TSV
- param-file “Per-gene annotation, optional. Row names must match the row names of the expression matrix. Provide as TSV, CSV or RDS.”:
Genes renamed
- “Format of gene annotation”:
TSV
- Rename galaxy-pencil the output:
CDS Monocle file
As usual, you can check the example history and the dedicated workflow (it doesn’t include the step on renaming the column header though).
Downsampling FASTQ files
Sometimes, it is useful to work on smaller subsets of data (especially for teaching / learning purposes). Here is an example of how you can downsample your FASTQ files. First, let’s get some toy data. We just need two FASTQ files - one containing barcodes, the other with transcripts.
Hands-on: Get toy data
- Create a new history for this section “Downsampling FASTQ Files”
Import the files from Zenodo
https://zenodo.org/record/4574153/files/SLX-7632.TAAGGCGA.N701.s_1.r_1.fq-400k.fastq https://zenodo.org/record/4574153/files/SLX-7632.TAAGGCGA.N701.s_1.r_2.fq-400k.fastq
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
Funnily enough, those files are already downsampled, so you won’t have to wait for too long to download them. We are not going to analyse that data anyway, it’s just for demonstration purposes. Quickly check which file contains barcodes and which file contains transcripts. If you click on the two datasets, you will see that one has shorter sequences, while the other has longer. It’s quite straight-forward to deduce that shorter sequences are barcodes.
Hands-on: Rename the files
Rename file
s_1.r_1
asBarcodes
Rename file
s_1.r_2
asTranscripts
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field
- Click the Save button
Now we will convert the FASTQ files to tabular:
Hands-on: FASTQ to tabular
- FASTQ to Tabular ( Galaxy version 1.1.5) with the following parameters:
- param-file “FASTQ file to convert”: param-files Select multiple files:
Barcodes
andTranscripts
- Rename galaxy-pencil the datasets
Barcodes tabular
andTranscripts tabular
Now let’s select the number of the reads we would like to keep. It’s totally up to you, we choose 100000
here.
Hands-on: Downsampling
- Select last ( Galaxy version 1.1.0) with the following parameters:
- param-file “Text file”: param-files Select multiple files:
Barcodes tabular
andTranscripts tabular
- “Operation”:
Keep last lines
- “Number of lines”:
100000
- Rename galaxy-pencil the dataset
Barcodes cut
andTranscripts cut
All done, now we just need to go back to FASTQ from Tabular again!
Hands-on: Tabular to FASTQ
- Tabular to FASTQ ( Galaxy version 1.1.5) with the following parameters:
- param-file “Tabular file to convert”:
Barcodes cut
(output of Select last tool)- “Identifier column”:
c1
orColumn 1
- “Sequence column”:
c2
orColumn 2
- “Quality column”:
c3
orColumn 3
- Rename galaxy-pencil the dataset
Downsampled barcode read
andDownsampled transcript read
And that’s all! Your downsampled data is ready to use. You can check your answers in this example history or if you want to accelerate this process, feel free to use the workflow next time!