Don’t get into useless internet arguments, kids

2017-09-022017-10-02 Šimon Podhajský30 Comments

Here's the product of that one time when I wanted to try out ggplot2 on a controversial dataset about prejudice in the EU, got dragged into an argument about methodology, and ended up learning about sjPlot/sjMisc and tmap in the process of writing a couple thousand words in RMarkdown.

Go read it. It has pretty graphs and ruins your faith in humanity.

"How prejudiced are we really? One more look at that 2015 Eurobarometer"

Talks from SIPS 2017

2017-08-052017-10-02 Šimon Podhajský21 Comments

The annual meeting of the Society for the Improvement of Psychological Science was amazing! You should see the OSF conference page if you missed out on it, and you should see the following presentations if you missed out on me:

The tidy-data portion of David Condon & I's Preparing and Curating Data for Sharing. (Here's an OSF page for the whole workshop.)
My lightning talk about automated testing of legacy research software, featuring my experience of setting up an automated integration test for PsychTaskFramework with Robot Framework and Sikuli.

Porting repositories between GitHub servers with octokit.rb

2017-06-18 Šimon Podhajský33 Comments

The project I'm working on, PsychTaskFramework, was initially developed on Yale's private git instance. This made perfect sense at the time: the project had no non-Yale collaborators and git.yale.edu is ~~easily~~ accessible to anyone with a Yale ID. And if we need to move later, no big deal, right? GitHub has to have a simple mechanism of porting repositories.

git repositories, yes. GitHub repositories - with labels, milestones, issues, pull requests, and comments? Not so much. The official GitHub support response was that I should avail myself of the API. So I did.

Note 1: My approach makes some avoidable compromises that I note below. Other shortcomings, however, are inherent to the process. The main one is the loss of all GitHub event metadata: all actions and events will appear to have been done at the time of the upload, by the uploading user account. (Luckily, this doesn't apply to git metadata.)

Note 2: I use Ruby and octokit.rb, but this approach should generalize easily to other languages for which Octokit is available.

Existing solutions

I'm not the first to run into this problem. Here are the three alternative solutions that I found most easily, None of them ports milestones or pull requests, nor are they actively maintained, but they might just get the job done, or at least form the backbone of your own solution.

If your repositories are not confined to the internal network, you might like github-issue-mover.
github-issue-migrate is an easily extensible Ruby class.
github-issue-import is a configurable tool written in Python. It makes certain choices about indicating issue state, e.g. ports closed issues as open issues that start with the word "[CLOSED]". It doesn't guarantee issue / milestone number equality, but if you're porting to an empty repository and you've never deleted a milestone, that might not be a problem.

Since I wanted to preserve milestones and pull requests -- and, in most regards, to essentially make a carbon copy of the original repository -- I had to roll my own. Here's how I did it. (If you're impatient, here are the scripts as gists.)

Step 1: Copy the commits, branches, and the wiki

This one is easy, because each git repository is a full copy. Just initialize a GitHub repo and push the bare Enterprise repository to it. GitHub has a step-by-step approach here; it includes moving the wiki, too.

Step 2: Get personal access tokens for both systems

For password-less authentication, go to Settings > Developer settings > Personal access tokens (/settings/tokens on each GitHub instance) and generate one. I was liberal with the scopes I allowed the tokens to have; the repo scope should be sufficient, but I haven't tested it.

You will want to revoke these tokens after you're done.

Alternatively, you can use any of the other forms of authentication that Octokit works with.

Step 3: Retrieve every GitHub object from the source repo

(Technically, you could retrieve each object in Step 4, as needed. I wanted to investigate the structure of the retrieved objects, though, and do it offline.)

This is the more straightforward part. Download labels, milestones, issues, pull requests, and comments; do so in the order in which they were created. This will make things a little easier later.

require 'octokit'
require 'json'

# Part 1: Extract issues & everything else from the source repo
## Setup
Octokit.configure do |c|
  c.api_endpoint = 'https://git.yale.edu/api/v3/'
  c.auto_paginate = true
end
# set ENTERPRISE_TOKEN prior to this line
yalegit = Octokit::Client.new(:access_token => ENTERPRISE_TOKEN)
repoName = 'levylab/RNA_PTB_task'

## Action
opts = {:state => :all, :sort => :created, :direction => :asc}
labels = yalegit.labels(repoName, {:state => :all})
issuesAndPRs = yalegit.issues(repoName, opts)
pulls = yalegit.pull_requests(repoName, opts)
milestones = yalegit.milestones(repoName, opts)
comments = yalegit.issues_comments(repoName, opts)

## Intermediate save
# Returned objects are Sawyer resources; we need
# `sawyer_resource.map(&:to_h)` to serialize them.
File.open('labels.json', 'w') do |f|
  f.write(labels.map(&:to_h).to_json)
end
# (...and so on for every element)

Why did we name a variable issuesAndPRs and then also retrieved pull requests? The Issues API treats pull requests as if they were issues. The Pull Request API obtains additional information that will be useful later.

Step 4: Push to the target repo -- in good order

This is where things get a little tricky. Here's why.

GitHub disallows you from deleting issues. To preserve links to issue numbers, you need to add the issues in the right order.
GitHub does allow you to delete a milestone, but it will only re-use its number if no newer milestone has been created since. Consequently, you will need to create placeholder milestones if you made any omissions.
GitHub doesn't allow you to set the numerical identifier for an object.
GitHub only allows link to objects that already exist. Consequently, we need to make sure that if we create an issue with a label, the label's already there.

The order we're going with is labels-milestones-issues-pulls-comments. Don't forget to adjust Octokit configuration for the target GitHub server:

require 'octokit'
require 'json'

# Part 3: Upload everything to target repo on GitHub
## Setup
Octokit.configure do |c|
  c.api_endpoint = 'https://api.github.com/'
  c.auto_paginate = true
end
# set GITHUB_TOKEN prior to this line
github = Octokit::Client.new(:access_token => GITHUB_TOKEN)
repo = 'YaleDecisionNeuro/PsychTaskFramework'

Labels

The main gotcha here is that GitHub has some default labels, which your source repository may or may not be partially using. If it is, we'll upload them, and if it isn't, they shouldn't be there anyway, so let's remove them:

github.labels(repo).each do |l|
  github.delete_label!(repo, l[:name])
end

In no particular order, read and upload your original labels:

labels = JSON.parse(File.read('labels.json'), {symbolize_names: true})
labels.each do |l|
  begin
    github.add_label(repo, l[:name], l[:color])
    puts "Added #{l[:name]} - ##{l[:color]}"
  rescue Exception => e
    puts "#{l[:name]} already exists, updating:" if e.class == Octokit::UnprocessableEntity
    github.update_label(repo, l[:name], {color: l[:color]})
  end
end

Milestones

As explained above, GitHub insists on numbering milestones by itself, but also allows milestone deletions. So we just need to pay attention to any milestones that are missing in our original data.

milestones = JSON.parse(File.read('milestones.json'), {symbolize_names: true}).sort_by {|m| m[:number]}
current_milestone = 0
fake_milestones = []
milestones.each do |m|
  current_milestone = current_milestone + 1
  while m[:number] > current_milestone
    github.create_milestone(repo, "fake #{current_milestone}")
    fake_milestones << current_milestone
    current_milestone = current_milestone + 1
  end
  github.create_milestone(repo, m[:title], {state: m[:state], description: m[:description]})
end

After that, it's trivial to remove the placeholders:

fake_milestones.each do |fake|
  github.delete_milestone(repo, fake)
end

Issues, PRs, and comments

We'll do all of issues, pull requests and comments in a single loop through the issues.

(This strikes some compromises that are harder to defend. The most complete approach, at least with the objects we'd retrieved thus far, would take separate passes for issue / PR creation, adding comments in the right order, and closing the issues if appropriate. The Octokit comment object does not include a direct reference to the issue number, though, and while extracting it is trivial, I just wanted to be done.)

First, we'll load the files, extract useful identifiers, and create the issue. Since issues are also auto-numbered but cannot be deleted, we'll also guard against the possibility of duplicating issues we had already added:

issuesAndPRs = JSON.parse(File.read('issuesAndPRs.json'), {symbolize_names: true}).sort_by { |p| p[:number] }
pulls = JSON.parse(File.read('pulls.json'), {symbolize_names: true}).sort_by { |p| p[:number] }
comments = JSON.parse(File.read('comments.json'), {symbolize_names: true}).sort_by { |p| p[:id] }

# In case uploading was interrupted, note the uploaded issues
issues_uploaded = github.issues(repo, {state: :all, sort: :created, direction: :desc})

issuesAndPRs.each do |i|
  # Extract identifiers from the issue
  # Skip existing issues
  issue_number = i[:number]
  unless issues_uploaded.empty?
    last_issue_id = issues_uploaded[0][:number]
    if issue_number <= last_issue_id
      next
    end
  end

  issue_url = i[:url]
  issue_labels = i[:labels].map { |l| l[:name] }
  begin
    issue_milestone = i[:milestone][:number]
  rescue Exception
    issue_milestone = nil
  end

  # Create issue
  sleep(3) # to avoid rate limiting
  github.create_issue(repo, i[:title], i[:body], {milestone: issue_milestone, labels: issue_labels})
end

But instead of closing the loop and going to the next issue, we'll do three more things. First, if the original issue was actually a pull request, we'll convert it into a PR or at least note the origin:

if i.key?(:pull_request)
  current_pull = pulls.select { |p| p[:number] == issue_number }[0]
  base = current_pull[:base][:ref]
  head = current_pull[:head][:ref]
  if i[:state] == "open"
    github.create_pull_request_for_issue(repo, base, head, issue_number)
  else
    merge_commit_sha = current_pull[:merge_commit_sha]
    base_sha = current_pull[:base][:sha]
    head_sha = current_pull[:head][:sha]
    pull_note = "**Migration note**: This was a pull request to merge "
    pull_note << "`#{head}` at #{head_sha} into `#{base}` at #{base_sha}. "
    pull_note << "It was merged in #{merge_commit_sha}.\n\n"
    new_body = pull_note + current_pull[:body]
    github.update_issue(repo, issue_number, { body: new_body })
  end
end

Second, we'll add the original comments to the issue:

comments.select { |c| c[:issue_url] == issue_url }.each do |c|
  github.add_comment(repo, issue_number, c[:body])
end

Finally, we'll close the issue if appropriate:

if i[:state] != 'open'
  github.close_issue(repo, issue_number)
end

This is a little confusing, so I'm noting again that the upload script is also available as a gist.

Step 5: Start working with the new copy of the repository

Add remotes to your working copies. Lock or remove the existing issues. Hang a big banner saying "Work has moved to a new location." Set up a post-receive hook that will automatically re-push commits to their new home.

Omissions, shortcomings, compromises

I was going for a good-enough facsimile, not the perfect replica. Here's what I skipped, and how you could preserve it if you cared to.

I didn't preserve complex issue timelines -- multiple closings and re-openings, changes of labels and milestones, and the like. You could retrieve the events and the comments via source.issue_timeline(repo, issueNumber), sort by :created_at, and add them to the target repository in the right order using the requisite API command. (In fact, you could retrieve everything via source.repository_events(repo) and then use the strategy pattern to walk the entire repo history. If I were making a fully general solution, that's what I'd go for.)
I haven't ported merged pull requests. In order for the GitHub API to create a pull request from an issue, there needs to be a difference between the base and head refs fails, Since the merge definitionally removed this difference, the API will refuse the conversion. To get around this, you'd have to find a way to "replay" the commits along with the repository events. Leaving a quick note about the historical origin of the repository seemed like a reasonable compromise.
In comments and issue descriptions, GitHub automagically creates links to existing issues. Automagic issue linking doesn't happen if the issue doesn't exist yet. You can get most of this by adding comments in the order in which they appeared, but even that can occasionally fail -- e.g. if you edited the checklist in the issue OP to link to a relevant issue created later. (You can hack this by iterating through all target issues and comments and making an invisible change like adding a space.)
Hard links point to the wrong location, which is to the source repo (e.g. the README.md linking to a wiki page, or a comment pointing to the canonical URL of a particular file at a particular commit). A content filter that replaces source URL with target URL before it pushes milestones, issues / PRs, and issue comments would be a clean way of fixing this.
There's no issue locking, because I hadn't locked any issues. It is trivial to add, though: check the boolean i[:locked].
Reactions to comments are lost. I'm not sure it would make sense for the uploader to add them.

Adventures with Qualtrics, part 2: exporting the latest response via API

2017-06-04 Šimon Podhajský20 Comments

(In Part 1, I wrote about the role of Piped Text and building a custom web service that Qualtrics will recognize.)

For the feature I was trying to implement in December, I needed to evaluate a batch of responses the subject answered earlier in the survey. Luckily, Qualtrics has an API that allows for response export! While the documentation has an example of a response export workflow, I found their per-format export pages more informative. Here's the CSV export documentation page. Still, I ran into some issues that merit documenting.

Requesting a single response? You can't

Since one of the embedded fields that Qualtrics creates is ResponseID, can't we just pass that and let our external service use it to grab our current participant's set of responses? Sadly, no. Qualtrics doesn't allow you to query at the level of a response, only at the level of a survey. (There is an optional lastResponseId parameter in the export query, but that will only get you all responses entered after the survey you're calling the service from. This could be useful if we were building a dataset incrementally, but in my case, I needed the data almost immediately.)

Instead, I assign the subject a unique ID early in the survey. This can be either pre-assigned or generated in the survey - perhaps with the random number generator web service I mentioned above. I pass this ID to my web service, which will use it to pick out the right response.

But we can't select on any response-level variable. This means that to limit our queries, we'll have to do some guessing. If we're sure that there are no race conditions -- i.e. only one person at a time only ever takes the survey -- we can use limit = 1 to only get the last response. Alternatively, if you know that the external service will be called immediately after the participant fills out the survey, you can use startDate set to a few hours before current time. (NB: the parameter value takes ISO-8601 format..)

The Nitty Gritty

Now, let's look at an example of the inquiry logic. In the abstract, there are three steps: get the response, unzip it, and load it into an appropriate data structure.

# Excerpt from a Sinatra helper function
response_zip = getResponseFromQualtrics()
response_string = unzip(response_zip)
csv_table = rawToTable(response_string)

Step 1: Get the data

Getting the data is a two-step process. First, I request a CSV file from Qualtrics and wait until it's ready. Second, I download it.

Instead of implementing the handshake myself, I took advantage of the qualtrics_api Ruby gem made by Yurui Zhang. (There's also sunkev's qualtrics gem, which I haven't tried.)

def getResponseFromQualtrics
  start_time = getStartTime(settings.prior_hours)

  QualtricsAPI.configure do |config|
    config.api_token = settings.token
  end

  survey = QualtricsAPI.surveys[settings.survey]
  export_service = survey.export_responses({start_date: start_time})
  export = export_service.start

  while not export.completed?
    sleep(5)
    export.status
  end

  require 'open-uri'
  return open(export.file_url, "X-API-TOKEN" => settings.token).read
end

def getStartTime(hours_offset)
  require 'time'
  start_time = Time.now.utc - (60 * 60 * hours_offset)
  return start_time.iso8601
end

(These are Sinatra helpers. settings is a Sinatra-wide global that reads in secrets specified in the environment and various other configuration. (The dotenv gem is excellent for secret storage in development; as for production, here's how to set secrets on Heroku.)

Steps 2 & 3: Unzip and convert

unzip is just rubyzip; no magic there. There is a bit of a trick to getting a compressed stream to a CSV with headers, though. That's because some of the Ruby CSV methods can only deal with files, not streams.

def rawToTable(response_string)
  require 'csv'
  response_csv = CSV.new(response_string, headers: true)
  response_csv = response_csv.read
  response_csv.delete_if do |row|
    # Remove the row with descriptions & internal IDs
    /^R_/ !~ row['ResponseID'] 
  end
  return response_csv
end

And done!

After this, I select the row that contains the subject ID I had passed in the Qualtrics redirect, pick a choice and evaluate it, and visualize it with an assist from the wonderful animate.css library at an endpoint created by Sinatra and deployed to Heroku. Unlike Qualtrics features, all are well-documented elsewhere.

Approach 2: Avoid the API, pass the values

The API approach has a number of problems. For one, Qualtrics API is a paid feature. Worse, API calls lag -- at least once, the call and processing took over 30 seconds and caused a request timeout. While I could re-write the interface so that the API call and processing are done by a background process that the front-end checks for periodically, it's a pain that might not be worth it.

The obvious alternative: instead of a subject identifier, pass the responses that the survey has readily available via URL. I write about this in part 1.

There are limits. Because Qualtrics uses GET for everything, you might have to keep your URI under 2000 characters. Basically, don't try to transmit essay responses. (I was worried that Qualtrics itself might throw a fit if I tell it to store 56k-character URI, because piped text is obviously longer than the response it denotes. I shouldn't have worried. Qualtrics managed even a 100k-character URI without a hiccup -- and that's way past the 2,000 characters that your browser and your server can handle. In other words, Qualtrics isn't going to be your constraint.)

As usual, the trade-off for speed is maintainability. You refer to many piped text variables instead of just one or two, so you will likely have to develop a pipeline to generate the URI. You might have named your questions for clearer data manipulation, but for the purposes of piped text, you'll have to replace them with the internal question IDs (QID#). And while you can maintain the order of values in one place, you have to explicitly plan for that.

Bonus Approach: No API is best API

Finally, I should note that custom web services and APIs are an extra overhead. For simpler problems, there are at least two steps to attempt first.

1. Abusing Survey Flow

Basic Survey Flow building blocks are quite powerful, making many problems tractable with stock Qualtrics. To pick randomly from a bag of option sets, you can use Randomization to pick exactly one of n embedded data blocks underneath it. Branches, of course, offer basic if conditionals (although not else -- you'll have to take care to make their triggering conditions mutually exclusive).

2. JavaScript

You can do some things with the Qualtrics Javascript. (For instance, if you can you get arbitrary piped text, that could make things easier.) You will need to weigh how much crucial logic you want to embed in JavaScript -- if you don't control the survey-taking environment, you cannot guarantee that the client has JS enabled, and you might have to take extra steps to either degrade functionality graciously or detect the absence.

Other approaches?

It is very possible that other approaches exist; they were not necessary for my purposes. In one of my next articles, I hope to talk about what they were.

Share this:

Share this:

Existing solutions

Step 1: Copy the commits, branches, and the wiki

Step 2: Get personal access tokens for both systems

Step 3: Retrieve every GitHub object from the source repo

Step 4: Push to the target repo -- in good order

Labels

Milestones

Issues, PRs, and comments

Step 5: Start working with the new copy of the repository

Omissions, shortcomings, compromises

Share this:

Requesting a single response? You can't

The Nitty Gritty

Step 1: Get the data

Steps 2 & 3: Unzip and convert

Approach 2: Avoid the API, pass the values

Bonus Approach: No API is best API

1. Abusing Survey Flow

2. JavaScript

Other approaches?

Share this: