Single Cell Publication - Data Analysis

Author(s)	Helena Rasche
Reviewers

Overview
Questions:

Objectives:

Requirements:

Time estimation: 1 hour

Level: Advanced Advanced

Supporting Materials:

Published: Nov 7, 2024

Last modification: Nov 7, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00469

version Revision: 1

Extracting data from the GTN’s Git history isn’t that difficult, but it requires some internal knowledge of how the GTN’s Jekyll-based codebase works. Here we’ll document what we’ve done!

Agenda

In this tutorial, we will cover:

Which PRs are Single Cell RPs?

Classifying PR Content

Contributors over time

All contributions over time

Our imports and metadata (all merged github PRs), and the list of all historical names of single-cell tutorial folders.

The author recommends running this code in a ‘Jekyll Console’ context. Jekyll does not natively have support for a console, but there is an open Pull Request to add it. We recommend you install this yourself to most easily run the following code. You can do that by:

Hands On: Installing Jekyll's Console
View the open Pull Request to add it

Download lib/jekyll/commands/console.rb to somewhere on your computer.
Find out where jekyll is installed:
Code In: Bash
gem which jekyll
Code Out
/home/user/galaxy/training-material/.direnv/ruby/gems/jekyll-4.3.3/lib/jekyll.rb
That means the commands should be in:
Code In: Bash
ls /home/user/galaxy/training-material/.direnv/ruby/gems/jekyll-4.3.3/lib/jekyll/commands/
Code Out
build.rb  clean.rb  doctor.rb  help.rb  new.rb  new_theme.rb  serve  serve.rb
Copy the console.rb you downloaded to that folder
 cp ~/Downloads/console.rb $(dirname $(gem which jekyll))/jekyll/commands/
This was written with commands in case folks want to copy paste it, to reduce error.
Launch the console with
jekyll console

Which PRs are Single Cell RPs?

require 'yaml'
data = YAML.load_file('metadata/github.yml')

Let’s fetch data from within the GTN’s infrastructure. You can see documentation for some of our APIs in the RDoc. E.g. here is how we document TopicFilter.list_materials_structured.

# Obtain all single cell materials
mats = TopicFilter
 .list_materials_structured(site, 'single-cell')
 .map { |k, v| v['materials'] }
 .flatten
 .uniq { |x| x['id'] }

# And flatten them into a useful list of old folder names
@sc = mats.map{|x| x['ref_tutorials'].map{|t| t['redirect_from']} + x['ref_slides'].map{|t| t['redirect_from']}}
    .flatten.uniq
    .reject{|x| x=~ /\/short\//} # /short/ is a folder of redirects
    .map{|x| x.split('/')[1..-2].join('/')} # Remove the filename.

The site object is currently required for calculating this list of structured materials. This is only available in the Jekyll console, so you’ll need to run this command in a Jekyll console. All of the subsequent steps can be run in a normal Ruby environment, but you might as well keep running it in the Jekyll console anyway.

Let’s go ahead and patch array to let us calculate a mean, because laziness is great actually

class Array
  def mean
    return self.sum / (0.0 + self.length)
  end
end

Here’s how we’ll define what is or isn’t a single cell tutorial, based on URL:

def is_sc(path)
  p = path.gsub(/-ES/, '_ES').gsub(/-CAT/, '_CAT'); 

  if p =~ /\/single-cell\// then
    return 1
  end

  # If anything in @sc is a prefix for p, then it's a single-cell file.
  if @sc.any?{|sc| p.start_with?(sc)} then
    return 1
  end
  
  return 0
end

Now let’s obtain everything that IS a single cell PR. Here we define everything with 50% or more of the files being “single cell” files, as a single cell PR. How did we arrive at 50%? We made some plots and spot-checked individual results to see what made sense.

sc_prs = data
  .reject{|num, pr| pr['author']['login'] == 'github-actions'} # Remove all automation
  .map{|num, pr| 
    [
      num,
      pr['files']
        .reject{|f| f['path'].split('/')[2] == 'images' && f['path'] !~ /scrna/ }
        .reject{|f| f['path'] =~ /^assets/}
        .map{|f| is_sc(f['path'])}
      .mean
    ]
  }
  .reject{|num, sc| sc < 0.5}
  # Reject NaN
  .select{|num, sc| sc == sc}

Output number 1 done!

File.open('dist.txt', 'w') do |f|
  f.puts "num\tdist"
  sc_prs.each do |num, sc|
    f.puts "#{num}\t#{sc}"
  end
end

Classifying PR Content

Let’s write a classifier for each file type, to enhance our statistics:

def classify(path)
  if path =~ /tutorial[A-Z_]*?\.md/ then
    return 'tutorial'
  elsif path =~ /slides.*\.html/ then
    return 'slides'
  elsif path =~ /faqs.*md/ || path =~ /^snippets/ then
    return 'faq'
  elsif path =~ /metadata.yaml/ || path == 'CONTRIBUTORS.yaml' || path =~ /index.md$/ || path =~ /README.md$/ then
    return 'metadata'
  elsif path =~ /\/workflows\// then
    return 'workflows'
  elsif path =~ /data-(library|manager)/ then
    return 'data-library'
  elsif path =~ /.bib$/ then
    return 'bibliography'
  elsif path =~ /\/images\// then
    return 'image'
  elsif path =~ /tutorials\/.*md/ then
    return 'tutorial'
  elsif path =~ /_plugins/ || path =~ /^bin/ || path =~ /_layouts/ || path =~ /_include/ || path == '_config.yml' || path =~ /assets/ || path =~ /Gemfile/ || path =~/shared/ then
    return 'framework'
  elsif path =~ /metadata\/.*.yaml/ then
    return 'metadata'
  elsif path =~ /metadata\/.*.csv/ || path =~ /Dockerfile/ then
    return 'ignore'
  elsif path =~ /^news/ then
    return 'news'
  end

  # This will raise an exception which will ensure we catch the case where we
  # haven't defined a classification rule for a file yet.
  1/ 0
end

And now we’ll classify each of the single cell pull requests by their file type:

results = []
results << [
  "num", "path", "class", "additions", "deletions", "createdAt", "mergedAt"
]
sc_prs.each do |num, _|
  data[num]['files'].reject{|f| f['path'] =~ /test-data/}.each do |f|
    results << [
      num, f['path'], 
      classify(f['path']),
      f['additions'], f['deletions'],
      data[num]['createdAt'], data[num]['mergedAt']
    ]
  end
end

Output number 2!

# save to file.csv
File.open('sc.tsv', 'w') do |f|
  results.each do |r|
    f.puts r.join("\t")
  end
end

Preview of that data:

num	path	class	additions	deletions	createdAt	mergedAt
5484	topics/single-cell/faqs/single_cell_omics.md	faq	3	3	2024-10-29T12:24:51Z	2024-10-29T12:47:23Z
5473	topics/single-cell/tutorials/alevin-commandline/tutorial.md	tutorial	48	47	2024-10-25T11:24:40Z	2024-10-28T12:59:16Z
5473	topics/single-cell/tutorials/scrna-case-jupyter_basic-pipeline/tutorial.md	tutorial	3	2	2024-10-25T11:24:40Z	2024-10-28T12:59:16Z
5473	topics/single-cell/tutorials/scrna-case_FilterPlotandExploreRStudio/tutorial.md	tutorial	1	0	2024-10-25T11:24:40Z	2024-10-28T12:59:16Z
5473	topics/single-cell/tutorials/scrna-case_FilterPlotandExplore_SeuratTools/tutorial.md	tutorial	1	0	2024-10-25T11:24:40Z	2024-10-28T12:59:16Z
5473	topics/single-cell/tutorials/scrna-case_JUPYTER-trajectories/tutorial.md	tutorial	6	5	2024-10-25T11:24:40Z	2024-10-28T12:59:16Z
5473	topics/single-cell/tutorials/scrna-case_alevin-combine-datasets/tutorial.md	tutorial	4	4	2024-10-25T11:24:40Z	2024-10-28T12:59:16Z
5473	topics/single-cell/tutorials/scrna-case_alevin/tutorial.md	tutorial	3	2	2024-10-25T11:24:40Z	2024-10-28T12:59:16Z
5473	topics/single-cell/tutorials/scrna-case_basic-pipeline/tutorial.md	tutorial	1	0	2024-10-25T11:24:40Z	2024-10-28T12:59:16Z
5473	topics/single-cell/tutorials/scrna-case_monocle3-rstudio/tutorial.md	tutorial	1	0	2024-10-25T11:24:40Z	2024-10-28T12:59:16Z
5473	topics/single-cell/tutorials/scrna-case_monocle3-trajectories/tutorial.md	tutorial	1	0	2024-10-25T11:24:40Z	2024-10-28T12:59:16Z
5473	topics/single-cell/tutorials/scrna-case_trajectories/tutorial.md	tutorial	1	0	2024-10-25T11:24:40Z	2024-10-28T12:59:16Z
5447	topics/single-cell/tutorials/GO-enrichment/tutorial.md	tutorial	7	3	2024-10-11T11:22:34Z	2024-10-11T13:36:22Z
5445	topics/single-cell/tutorials/scrna-case-cell-annotation/slides.html	slides	1	1	2024-10-11T10:52:08Z	2024-10-12T15:44:00Z
5443	topics/single-cell/faqs/single_cell_omics.md	faq	4	2	2024-10-11T10:16:18Z	2024-10-12T15:43:24Z
5416	topics/single-cell/metadata.yaml	metadata	4	2	2024-10-07T22:26:36Z	2024-10-12T15:35:11Z
5416	topics/single-cell/tutorials/GO-enrichment/tutorial.md	tutorial	1	0	2024-10-07T22:26:36Z	2024-10-12T15:35:11Z

Contributors over time

Additionally we want to figure out how our contributions and contributors changed over time.

require 'yaml'
require 'date'
require 'pp'

Here are the current classifications of contributors:

KEYS = %w[authorship editing testing ux infrastructure translation data]

Let’s get all the single cell tutorials:

tutorials = Dir.glob("topics/single-cell/tutorials/*/tutorial.md")

We’ll want to get all the timepoints from 2019 (when single cell was added) to 2025, so we’ll setup an empty data structure for this.

NB This is NOT the best solution, this is just a simple brute force solution because the runtime is fine actually.

timepoints = []
(2019..2025).each do |year|
  (1..12).each do |month|
    s = "#{year}-#{month}-01T00:00:00Z"
    timepoints << [
      s,
      DateTime.parse(s).to_time.to_i
    ]
  end
end
contribs_over_time = timepoints.map{|n, t| [t, KEYS.map{|k| [k, []]}.to_h]}.to_h

Ok, let’s get the history of each tutorial

tutorials.each do |tutorial|
  # if tutorial !~ /bulk/
  #   next
  # end

  git=`git log --follow --name-only --format="GTN_GTN %H %at" #{tutorial}`
  commits = git.split("GTN_GTN ")
  commits.reject!{|c| c.empty?}
  commits.map!{|c| 
    res = c.gsub(/\n+/, "\t").split(/\t/)
    if res.size > 2
      puts "ERROR: #{res}"
    end

    hash = res[0].split(' ')[0]
    time = res[0].split(' ')[1].to_i

    f = res[1]
    contents_at_time = `git show #{hash}:#{f}`
    begin
      contents_meta = YAML.load(contents_at_time)
    rescue
      next
    end

    if contents_meta.nil?
      next
    end

    if contents_meta.key?("contributors")
      c = {
        'authorship' => contents_meta["contributors"],
      }
    else
      c = contents_meta["contributions"]
    end

    squashed_i = DateTime.parse(Time.at(time).strftime("%Y-%m-01T00:00:00Z")).to_time.to_i

    {
      :hash => hash,
      :time => time,
      :date => Time.at(time),
      :sqsh => squashed_i, # The time rounded to the month
      :path => res[1],
      :role => c
    }
  }

  # For every commit
  commits.reverse.compact.each do |c|
    KEYS.each do |k|
      # For every role
      if c[:role].key?(k)
        # add to contribs now and at every time point in the future
        now_and_future_keys = contribs_over_time.keys.select{|t| t >= c[:sqsh] }
        now_and_future_keys.each do |t|
          contribs_over_time[t][k] << c[:role][k]
          contribs_over_time[t][k].flatten!
          contribs_over_time[t][k].uniq!
        end
      end
    end
  end
end

See, terrible, but it works!

contribs_over_time.reject!{|k, v| v.values.all?{|vv| vv.empty?}}
pp contribs_over_time

File.open("sc-roles.tsv", "w") do |f|
  f.write("date\tarea\tcount\tcontributors\n")

  KEYS.each do |k|
    f.write(contribs_over_time.map{|date, roles| [date, roles[k]]}.map{|date, contribs| "#{Time.at(date).strftime("%Y-%m-01")}\t#{k}\t#{contribs.count}\t#{contribs.join(',')}"}.join("\n"))
  end
end

All contributions over time

The plot of all new items over time was requested, and since the git log approach is so deeply ugly and slow, let’s do something slightly different.

The date that Single Cell Tutorials, FAQ, Video, News, Events were added is requested.

The GTN produces RSS feeds of all new content, as well as feeds subsetted by their area (e.g. the single cell feed or the single cell grouped by month) so we can use that to get a list of all new single cell content over time.

require 'net/http'
require 'date'
require './_plugins/util'
require 'uri'

# Ruby's stdlib for doing web requests is ... not great.
def request(url)
  uri = URI.parse(url)
  request = Net::HTTP::Get.new(uri)
  req_options = {
    use_ssl: uri.scheme == 'https',
  }
  Net::HTTP.start(uri.hostname, uri.port, req_options) do |http|
    http.request(request)
  end
end

Let’s look though the data we get back:

feed = request('https://training.galaxyproject.org/training-material/feeds/single-cell-month.xml').body
# The body of the feed has structure but it's slightly inconvenient.
# The URLs however...
# <a href="https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-preprocessing/workflows/scrna_mp_celseq.html?utm_source=matrix&amp;utm_medium=newsbot&amp;utm_campaign=matrix-news">CelSeq2: Multi Batch (mm10) (February 22, 2019)</a>
#
# That has everything and a date, can just figure out what 'type' of thing it was by the folder/name.

def classify(url)
  if url =~ /tutorial(_[A-Z_]*)?\.html/
    return 'tutorial'
  elsif url.include? '/events/'
    return 'event'
  elsif url =~ /slides(_[A-Z_]*)?\.html/
    return 'slide'
  elsif url.include? '/faqs/'
    return 'faq'
  elsif url.include? '/workflows/'
    return 'workflow'
  elsif url.include? '/news/'
    return 'news'
  end

  p url

  1/0
end

out = []

feed.scan(/<a href="([^"]+)">([^<]+)<\/a>/).each do |url, title|
  next if url.include? 'gtn-standards-rss.html'
  date = title.match(/\(([^)]+)\)$/)[1]
  date = Date.parse(date).to_s
  out << [date, classify(url)]
end

And that’s almost done! Let’s add in the video data as well:

# Let's add in videos, they're stored in the tutorials. Thankfully these have a
# recording date we can use.
tutos = Dir.glob("topics/single-cell/tutorials/*/tutorial.md")
tutos.each do |tuto|
  meta = safe_load_yaml(tuto)
  next if meta['recordings'].nil? || meta['recordings'].empty?

  meta['recordings'].each do |rec|
    out << [rec['date'], 'video']
  end
end

and write it out:

File.open("single-cell-over-time.tsv", "w") do |f|
  f.puts "date\ttype"
  out.sort_by{|d, t| d}.each do |date, type|
    f.puts "#{date}\t#{type}"
  end
end

With that, I think we’re done! Ready to plot.

Let us know and we can perhaps generalise this dataset such that you could more easily download it for your own community.

You've Finished the Tutorial

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Helena Rasche, Single Cell Publication - Data Analysis (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/contributing/tutorials/meta-analysis-data/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{contributing-meta-analysis-data,
author = "Helena Rasche",
	title = "Single Cell Publication - Data Analysis (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/contributing/tutorials/meta-analysis-data/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

Developing GTN training material
This tutorial is part of a series to develop GTN training material, feel free to also look at:

Contributing to the Galaxy Training Network with GitHub

Overview of the Galaxy Training Material

Generating PDF artefacts of the website

Preview the GTN website as you edit your training material

Adding Quizzes to your Tutorial

Teaching Python

Tools, Data, and Workflows for tutorials

Design and plan session, course, materials

Principles of learning and how they apply to training and teaching

Adding auto-generated video to your slides

Creating Interactive Galaxy Tours

Updating diffs in admin training

GTN Metadata

Contributing with GitHub via its interface

Including a new topic

Creating a new tutorial

FAIR Galaxy Training Material

Single Cell Publication - Data Analysis

Single Cell Publication - Data Plotting

Creating content in Markdown

Creating Slides

Updating tool versions in a tutorial

Do you want to extend your knowledge?
Follow one of our recommended follow-up trainings:

tutorial Hands-on: Single Cell Publication - Data Plotting

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.
shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/contributing/tutorials/meta-analysis-data/tutorial.json | jq .admin_install_yaml -r)
Alternatively you can copy and paste the following YAML
---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools: []

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.