In this tutorial we will use OpenRefine tool to clean occurrence records retrieved from GBIF.
This tutorial is based on the GBIF Laure Russell 2021 “Biodiversity data mobilization course” accessible on GBIF website on Use case 1 “Herbarium Specimens” using file from Exercise 3C, at the bottom of the page, reachable through this direct URL.
Retrieve data from GBIF “Biodiversity data mobilization course”
Hands On: Data import and project creation
Create a new history for this tutorial and give it a name (example: “OpenRefine tutorial”) for you to find it again later if needed.
To create a new history simply click the new-history icon at the top of the history panel:
Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)
Type the new name
Click on Save
To cancel renaming, click the galaxy-undo “Cancel” button
If you do not have the galaxy-pencil (Edit) next to the history name (which can be the case if you are using an older version of Galaxy) do the following:
Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
Faceting is a feature that will allow us to get a big picture overview of the data, and to filter down to just the subset of rows
that we want to change or view in bulk. It facilitates the use and analysis of data and can be done with cells containing any
kind of text, numbers and dates.
Hands On: Hands-on: Faceting and mass editing
Go to column “kingdom”, and then click on the column menu and follow the route to Text facet.
On the left a window with the name of the column will appear, that is the facet.
Click on count to sort by count, then click on name to sort alphabetically
Fix the spelling mistakes (Plante -> Plantae). Place the cursor over the text in the window and click on edit, then fix the error in the text box, and to save click on apply.
Go to “Country col.” column and click on column menu and perform a Text Facet
On a quick view, the country appears to be spelled correctly, but the facet shows three different values due to the extra spaces at the end of the text.
Fix the error from the column menu on “Country col.” column, following the route Edit Cells > Common transforms > Trim leading and trailing whitespace.
You will see a notification message “Text transform on 38 cells in column Country col.: value.trim()” and the possibility to come back on previous state clicking on the undo hyperlink.
Now check the facet window; only one value will remain.
Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.
Finding
Below are just a few examples of basic expressions:
Regular expression
Matches
abc
an occurrence of abc within your data
(abc|def)
abcordef
[abc]
a single character which is either a, b, or c
[^abc]
a character that is NOT a, b, nor c
[a-z]
any lowercase letter
[a-zA-Z]
any letter (upper or lower case)
[0-9]
numbers 0-9
\d
any digit (same as [0-9])
\D
any non-digit character
\w
any alphanumeric character
\W
any non-alphanumeric character
\s
any whitespace
\S
any non-whitespace character
.
any character
\.
{x,y}
between x and y repetitions
^
the beginning of the line
$
the end of the line
Note: you see that characters such as *, ?, ., + etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So \? matches the question mark character exactly.
Examples
Regular expression
matches
\d{4}
4 digits (e.g. a year)
chr\d{1,2}
chr followed by 1 or 2 digits
.*abc$
anything with abc at the end of the line
^$
empty line
^>.*
Line starting with > (e.g. Fasta header)
^[^>].*
Line not starting with > (e.g. Fasta sequence)
Replacing
Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups (...), which we can refer to using \1, \2 etc for the first and second captured values. If you want to refer to the whole match, use &.
Regular expression
Input
Captures
chr(\d{1,2})
chr14
\1 = 14
(\d{2}) July (\d{4})
24 July 1984
\1 = 24, \2 = 1984
An expression like s/find/replacement/g indicates a replacement expression, this will search (s) for any occurrence of find, and replace it with replacement. It will do this globally (g) which means it doesn’t stop after the first match.
Example: s/chr(\d{1,2})/CHR\1/g will replace chr14 with CHR14 etc.
You can also use replacement modifier such as convert to lower case \L or upper case \U. Example: s/.*/\U&/g will convert the whole text to upper case.
Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the s/../../g structure.
There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.
Tip:RegexOne is a nice interactive tutorial to learn the basics of regular expressions.
Tip:Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.
Tip:Cyrilex is a visual regular expression tester.
Keep in mind that the correct counties are: “Flores”, “La Libertad”, “Melchor de Mencos”, “San Andres” and “San Jose”.
On the top right of the facet window click on Cluster, a new window will appear.
Click on the Cluster button from this new window.
Now you can see information about the clusters:
“Cluster size”: the number of different versions that the clustering algorithm believes to be the same.
“Row count”: the number of records with any of the cluster values.
“Values in cluster”: the actual values that the algorithm believes to be the same. There is also the number of records with each particular value, and the possibility to browse the contents of the cluster in a different tab.
“Merge?”: check if values are to be merged into a single standard value.
“New cell value”: the value to be applied to every record in the cluster. By default, it is the value with most records. You can also click on any value to apply that to the New cell value.
Comment
If you want to know more about clustering have a look at the manual
Click on Select All and then on Merge Selected & close
You will see a notification message “Mass edit 119 cells in column County”.
To fix the remaining counties go again to Cluster in the facet window of Count.
In the Cluster and edit window, go to Keying Function, then select ngram-fingerprint
set “1” as the value in n-Gram Size.
Press the Cluster button, you normally see a cluster about “San Andres” of size “4”.
Click on Select All and then on Merge Selected & close, you will see a notification message “Mass edit 360 cells in column County”.
Hands On: hands_on Hands-on: Exporting cleaned file into your Galaxy history.
On the upper right corner click on Export and select Galaxy exporter.
A notification message as “Dataset has been exported to Galaxy, please close this tab” is displayed.
You normally have your resulting data file exported on your Galaxy history as “openrefine-Galaxt file.tsv” dataset.
Comment
You can also download directly file on your computer choosing Export and Custom tabular exporter, allowing you to select notably columns and rows you want to download.
Reconciliation matches the information in one of your columns to an outside database. This is particularly helpful when it
comes to name validation, as it proves the name you have exists somewhere else. This is a really useful service, but can be
time consuming. In this case we will go through the process with only three records using the API from GBIF.
Hands On: hands_on Hands-on: Higher taxonomy.
Go to “Collector” column, then make a Text facet. Select the collector “Elsa P”.
Under “Full name”, click on column menu and then Edit column > Add column by fetching URLs…
call the new column “Api_name”
Change the Throttle Delay to 250 and paste the expression http://api.gbif.org/v1/species/match?verbose=true&name="+escape(value,'url')
Click ok and wait, this might take some time depending on internet connection and the number of taxa.
Go to “Api_name”, click on column menu and then Edit column > Add column based on this column....
Call the new column “higherClassification” and paste the expression:
You will see the Kingdom, Phylum, Class, Order and family of each taxon.
Under “higherClassification” follow the route Edit column > Split into several columns…, leave the initial settings.
Now you know how to obtain the taxonomic categories of a given taxon if this is available in the GBIF API. Column names can be edited in Edit column > Rename this column.
For the purpose of the original GBIF workshop, the columns created in this exercise (Higher taxonomy) must be deleted.
Under All, which is the first column, go to Edit columns > Re-order / remove columns….
Here you learned how to use OpenRefine tool from Galaxy platform to clean Biodiversity data. This tutorial notably allowed you to apply some basic but powerfull functionnalities of OpenRefine to clean your data.
You've Finished the Tutorial
Please also consider filling out the Feedback Form as well!
Key points
OpenRefine is a powerful tool, with many functionalities to check, clean and enrich your data
Frequently Asked Questions
Have questions about this tutorial? Have a look at the available FAQ pages and support channels
Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{ecology-openrefine_gbif,
author = "Yvan Le Bras and Sophie Pamerlon and Laura Russell",
title = "Cleaning GBIF data using OpenRefine (Galaxy Training Materials)",
year = "",
month = "",
day = "",
url = "\url{https://training.galaxyproject.org/training-material/topics/ecology/tutorials/openrefine_gbif/tutorial.html}",
note = "[Online; accessed Thu Feb 20 2025]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol}
}
Funding
These individuals or organisations provided funding support for the development of this resource