Cleaning GBIF data for the use in Ecology

Author(s) orcid logoYvan Le Bras avatar Yvan Le BrasSimon Benateau avatar Simon Benateau
Reviewers Helena Rasche avatarSaskia Hiltemann avatarYvan Le Bras avatar
Overview
Creative Commons License: CC-BY Questions:
  • How can I get ecological data from GBIF?

  • How do I check and clean the data from GBIF?

  • Which ecoinformatics techniques are important to know for this type of data?

Objectives:
  • Get occurrence data on a species

  • Visualize the data to understand them

  • Clean GBIF dataset for further analyses

Requirements:
Time estimation: 30 minutes
Supporting Materials:
Published: Oct 28, 2022
Last modification: Jun 27, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00129
version Revision: 5

GBIF (Global Biodiversity Information Facility, www.gbif.org) is for sure THE most remarkable biodiversity data aggregator worldwide giving access to more than 1 billion records across all taxonomic groups. The data provided via these sources are highly valuable for research. However, some issues exist concerning data heterogeneity, as they are obtained from various collection methods and sources.

In this tutorial we will propose a way to clean occurrence records retrieved from GBIF.

This tutorial is based on the Ropensci Zizka tutorial.

Agenda

In this tutorial, we will cover:

  1. Retrive data from GBIF
    1. Get data
    2. Where do the records come from?
    3. Filtering data based on the data origin
    4. Have a look at the number of counts per record
    5. Filtering data on individual counts
    6. Have a look at the age of records
    7. Filtering data based on the age of records
    8. Taxonomic investigation
    9. Filtering
    10. Sub-step with OGR2ogr
    11. Visualize your data on a GIS oriented visualization
  2. Conclusion

Retrive data from GBIF

Get data

Hands-on: Data upload
  1. Create a new history for this tutorial

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Import the files from GBIF: Get species occurrences data tool with the following parameters:
    • param-file “Scientific name of the species”: write the scientific name of something you are interested on, for example Loligo vulgaris
    • “Data source to get data from”: Global Biodiversity Information Facility : GBIF
    • “Number of records to return”: 999999 is a minimum value
    Comment

    The spocc Galaxy tool allows you to search species occurrences across a single or many data sources (GBIF, eBird, iNaturalist, EcoEngine, VertNet, BISON). Changing the number of records to return allows you to have all or limited numbers of occurrences. Specifying more than one data source will change the manner the output dataset is formatted.

  3. Check the datatype galaxy-pencil, it should be tabular

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select tabular from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  4. Add tags galaxy-tags to the dataset
    • make them propagating tags (tags starting with #)
    • make a tag corresponding to the species (#LoligoVulgaris for example here)
    • and another tag mentioning the data source (#GBIF for example here).

    Tagging dataset like this is good practice in Galaxy, and will help you 1/ finding content of particular interest (using the filtering option on the history search form for example) and 2/ visualizing rapidly (notably thanks to the propagated tags) which dataset is associated to which content.

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

Where do the records come from?

Here we propose to investigate the content of the dataset looking notably at the “basisOfRecord” attribute to know more about heterogeneity related to the data collection origin.

Hands-on: "basisOfRecord" filtering
  1. Count tool with the following parameters:
    • param-file “from dataset”: output (output of Get species occurrences data tool)
    • “Count occurrences of values in column(s)”: c[17]
    Comment

    This tool is one of the important “classical” Galaxy tool who allows you to better synthesize information content of your data. Here we apply this tool to the 17th column (corresponding to the basisOfRecord attribute) but don’t hesitate to investigate others attributes!

Question
  1. How many different types of data collection origin are there?
  2. What is your assumption regarding this heterogeneity?
  1. 5
  2. each basisOfRecord type is related to different collection method so different data quality

Filtering data based on the data origin

Hands-on: Filter data on basisOfRecord GBIF attribute
  1. Filter tool with the following parameters:
    • param-file “Filter”: output (output of Get species occurrences data tool)
    • “With following condition”: c17=='HUMAN_OBSERVATION' or c17=='OBSERVATION' or c17=='PRESERVED_SPECIMEN'
    • “Number of header lines to skip”: 1
    Comment

    A comment about the tool or something else. This box can also be in the main text

    Question
    1. How many records are kept and what is the percentage of filtered data?
    2. Why are we keeping only these 3 types of data collection origin?
    1. 470 and 8.79% of records were drop out
    2. These data collection methods are the most relevant
  2. Add to the output dataset a propagating tag corresponding to the filtering criteria adding #basisOfRecord string for example

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

Have a look at the number of counts per record

Here we propose to have a look at the number of counts by record to know if there is some possible records with errors.

Hands-on: Summary statistics of count
  1. Summary Statistics tool with the following parameters:
    • param-file “Summary statistics on”: out_file1 (output of Filter tool)
    • “Column or expression”: c72
  2. Add to the output dataset a propagating tag corresponding to the filtering criteria adding #individualCount string for example
Question
  1. What is the min and max of individual counts?
  1. From 1 to 100

Filtering data on individual counts

Hands-on: Filter data on individualCount GBIF attribute
  1. Filter tool with the following parameters:
    • param-file “Filter”: out_file1 (output of Filter tool)
    • “With following condition”: c72>0 and c72<99
    • “Number of header lines to skip”: 1
Question
  1. How many records are kept and what is the percentage of filtered data?
  2. How can you explain this result?
  3. Which propagated tag you can propose to add here?
  1. 50 and 89.29% o records were drop out
  2. An important percentage of data were drop out because of many records whithout any value for this individual count field
  3. As for the previous “count” step you are dealing with the individualCount column, you can add a to the output dataset a #individualCount tag for example

Have a look at the age of records

Hands-on: Here we propose to have a look at the age of records, through the `year` GBIF attribute to know if there is some ancient data to maybe not consider.
  1. Summary Statistics tool with the following parameters:
    • param-file “Summary statistics on”: out_file1 (output of Filter tool)
    • “Column or expression”: c41
  2. Add to the output dataset a propagating tag corresponding to the filtering criteria adding #ageOfRecord string for example
Question
  1. What is the year of the older and younger records?
  2. Why do you think of interest to treat differently ancient and recent records?
  1. From 1903 to 2018
  2. We can assume ancient records are not made in the same way than recent one so keeping ancient records can enhance heterogeneity of our dataset.

Filtering data based on the age of records

Hands-on: Filter data on ageOfRecord GBIF attribute
  1. Filter tool with the following parameters:
    • param-file “Filter”: out_file1 (output of Get species occurrences data tool)
    • “With following condition”: c41>1945
    • “Number of header lines to skip”: 1
    Comment

    A comment about the tool or something else. This box can also be in the main text

    Question
    1. How many records are kept and what is the percentage of filtered data?
    2. Why are we keeping only data from 1945?
    1. 44 and 11.76% of records were drop out
    2. This arbitrary date allow to have only quite recent records, but you can specify another year.
  2. Add to the output dataset a propagating tag corresponding to the filtering criteria adding #ageOfRecord string for example

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

Taxonomic investigation

Hands-on: Investigate the taxonomic coverage, at the family level
  1. Count tool with the following parameters:
    • param-file “from dataset”: out_file1 (output of Filter tool)
    • “Count occurrences of values in column(s)”: c[31]
    Comment

    This column allows us to look at the different families associated to records. Normally, looking at a unique species, we will obtain only one family

Filtering

Hands-on: Filter data on family attribute
  1. Filter tool with the following parameters:
    • param-file “Filter”: out_file1 (output of Filter tool)
    • “With following condition”: c31=='Loliginidae'
    • “Number of header lines to skip”: 1
    Comment

    We here select only records with the family of interest, Loliginidae

Question
  1. Is the filtering here of interest ?
  2. Why keeping this step can be of interest?
  1. No, because 100% of records are kept
  2. Because this is an important step we have to take into account in such a GBIF data treatment, and if your goal is to create your own workflow you plan to use on others species, this can be of interest to keep this step

Sub-step with OGR2ogr

Hands-on: Convert occurrence dataset to GIS one for visualization
  1. OGR2ogr tool with the following parameters:
    • param-file “Gdal supported input file”: out_file1 (output of Filter tool)
    • “Conversion format”: GEOJSON
    • “Specify advanced parameters”: Yes, see full parameter list.
      • In “Add an input dataset open option”:
        • param-repeat “Insert Add an input dataset open option”
          • “Input dataset open option”: X_POSSIBLE_NAMES=longitude
        • param-repeat “Insert Add an input dataset open option”
          • “Input dataset open option”: Y_POSSIBLE_NAMES=latitude
Question
  1. Did you have access to standard output and error of the original R script?
  2. What kind of information you can retrieve here in the standard output and/or error?
  1. Yes, of course ;) A previsualization of stdout is visible when clicking on the history output dataset and full report accessible through the information button, then stdout or stderr (here you can see warnings on the stderr)
  2. The stderr is showing several warning related to automatic variable name mapping from GBIF to OGR plus information about application of a truncate process on a particularly long GeoJSON value

Visualize your data on a GIS oriented visualization

From your GeoJSON Galaxy history dataset, you can launch GIS visualization.

Hands-on: Launch OpenLayers to visualize a map with your filtered records
  1. Click on the Visualize tab on the upper menu and select Create Visualization
  2. Click on the OpenLayers icon
  3. Select the GeoJSON file from your history
  4. Click on Create Visualization
  5. Select Openlayers
Question
  1. You don’t see Opebnlayers? Did you know why?

1.If you don’t see Openlayers but others visualization types like Cytoscape, this means your datatype is JSON, not geojson. You have to change the datafile manually before visualizing it

Conclusion

In this tutorial we learned how to get occurrence records from GBIF and several steps to filter these data to be ready to analyze it! So now, let’s go for the show!