Data curation: Tidy data & metadata

Simon Podhajsky & David Condon

July 30, 2017

Outline

The steps needed to get a shareable dataset!

  1. Getting your data set tidied up
  2. Creating metadata documentation

Part 1: Tidy data

Slogan

Write code for humans. Write data for computers.

– Vince Buffalo, Bioinformatics Data Skills

First: What challenges do you face?

Note on file organization

 data/raw/
 data/clean/
 processing/
 analysis/
 README

We will be sharing two files at minimum

Note on file formats

Preferred:

  • plain text
  • open (non-proprietary) format
    • I don’t need to buy software to view this
    • Ideally, I don’t need to install software to view this

Data is tidy if…

  1. each observation occupies a single row
  2. each variable occupies a single column
  3. one type of observations resides in a single table.

Data is tidy if…

  1. each observation occupies a single row
  2. each variable occupies a single column
  3. one type of observations resides in a single table.

(source: Chapter 12 of R for Data Science)

Easy?

Yeah, right

  • Is observation…
    • a survey submission?
    • each response in that submission?
  • What separates kinds of observation?
  • What separates observations from variables? (work_phone vs. home_phone)
  • Might depend on context and discretion, but broad intuitions apply broadly

Negative take: tidy ~ not messy

Messy data is usually messy in the following ways:

  • Column headers are values, not variable names
  • Multiple variables are stored in one column
  • Variables are stored in both rows and columns
  • Multiple types of experimental unit stored in the same table
  • One type of experimental unit stored in multiple tables

We’ll go over each in the example.

Origin of “tidy data”

Hadley Wickham’s (2014) Tidy Data paper. Do read it if you can.

(much of this presentation is modeled on his original slides)

Messy dataset example: Your plans for SIPS

Dataset excerpt (1/2)

knitr::kable(x[106, 1:4], row.names = FALSE)
Timestamp Name Email address Which events do you plan to attend on Sunday, July 30?
7/17/2017 10:45:31 Fred Satterfield px@kiwabmx.bzq [Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Workshop: Fundamentals of meta-analysis (9:30a-12:30p), Workshop: Using OSF (1:30-3:30p), Workshop: Writing papers to be transparent, reproducible, and fabulous (3:30-5:30p), Hack: Resources for changing culture in department, university, society (all day)

Dataset excerpt (2/2)

Which events do you plan to attend on Monday, July 31? (not including lightning talks, unconference sessions, and hackathons that will continue or start on 2nd day) Which events do you plan to attend on Tuesday, August 1st?
Workshop: R Markdown (1:30-3:30p), Workshop: Power analysis and effect size (3:30p-5:30p), All Conference Social Event (Dinner+drinks; 7:30p-??) Workshop: Fundamentals of R (9:30a-12:30p)

What are the problems here?

  • Unwieldy column names
  • Ambiguous commas - some separate plans, others are used in a sentence
    • “Workshop: Using OSF (1:30-3:30p), Workshop: Writing papers to be transparent, reproducible, and fabulous (3:30-5:30p)”
  • Identifying personal information
  • Data mistakes?
    • Brian’s social event is listed on the wrong day
    • What timezone is the datetime? (More in the Metadata section)
    • Double submissions?
  • And, of course, not tidy
# Read the file in with better headers
library(readr)
x <- read_csv('data/raw/plans_raw.csv',
              col_names = c("time_submitted", "name", "email", 
                            "day1", "day2", "day3"),
              col_types = list(
                col_datetime(format = '%m/%d/%Y %H:%M:%S'), 
                col_character(), col_character(), 
                col_character(), col_character(),
                col_character()),
              skip = 1) # skip row with headers
head(x)
## # A tibble: 6 x 6
##        time_submitted             name                    email
##                <dttm>            <chr>                    <chr>
## 1 2017-07-14 11:54:46  Bernadine Ratke              ad@xosz.tdi
## 2 2017-07-14 10:13:10    Juliane Kling     jbacqmnoug@qcsrv.hut
## 3 2017-07-14 10:14:02  Argelia DuBuque        sbinxjw@bprhc.ybo
## 4 2017-07-14 10:16:39 Williams Mueller xrugcmajh@qbaetkvwmj.kwe
## 5 2017-07-14 11:17:02    Davida Ledner              iokv@vl.dyt
## 6 2017-07-16 03:07:14     Ethan Klocko               kl@jbn.vmb
## # ... with 3 more variables: day1 <chr>, day2 <chr>, day3 <chr>
knitr::kable(x[106, 4:6], row.names = FALSE)
day1 day2 day3
[Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Workshop: Fundamentals of meta-analysis (9:30a-12:30p), Workshop: Using OSF (1:30-3:30p), Workshop: Writing papers to be transparent, reproducible, and fabulous (3:30-5:30p), Hack: Resources for changing culture in department, university, society (all day) Workshop: R Markdown (1:30-3:30p), Workshop: Power analysis and effect size (3:30p-5:30p), All Conference Social Event (Dinner+drinks; 7:30p-??) Workshop: Fundamentals of R (9:30a-12:30p)

Now, let’s talk about what makes this messy.

Mess #1: mixing different data sets

  • Our datasets centers on plans for specific events. Why should we have people’s emails here?
  • If we need to do any processing on a per-person level, things become unwieldy fast.

Tidy solution: divide the data units into tables you can merge as needed

people <- select(x, time_submitted, name, email) %>% unique()
people$data_origin <- 'plans_raw.csv' # if many possible sources

# if non-existent, create an ID to join the tables by. 
# (Usually, you might have an ID for each participant at 
# this point, which you'd now deidentify.)
people$participant_ID = sample(1:nrow(people), nrow(people))

# remove the people-specific data from `x`
x <- merge(x, people) %>% select(-time_submitted, -name, 
                                 -email, -data_origin)
  • Question: Should we separate the table of events and plans, too?

As a bonus, sensitive info is relegated to a single spot, and is easier to deal with - for example, extract non-identifying features of interest:

# dplyr::transmute drops all columns that are not explicitly named
people <- transmute(people, 
                    time_submitted, participant_ID,
                    email_server = gsub('.+@([^@]+)$', '\\1', email),
                    number_names = length(strsplit(name, ' ')[[1]])
)
time_submitted participant_ID email_server number_names
2017-07-14 11:54:46 88 xosz.tdi 2

Of course, this might still be identifying information - consult your local HIPAA board.

Mess #2: Data stored in column names

  • Each column stores the date on which the event is held.
  • Even though we renamed it from What are you planning to do on August 1 to day1, that is still plan-level info.
day1 day2 day3 participant_ID
[Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Hack: Syllabus for Research Methods (all day) Workshop: R Markdown (1:30-3:30p), Re-hack: Diversity and inclusion (morning), All Conference Social Event (Dinner+drinks; 7:30p-??) NA 60

Tidy solution: translate the data from wide to long

x <- tidyr::gather(x, event_day, event, -participant_ID)
x$event_day <- as.numeric(gsub('day', '', x$event_day)) # clean up
participant_ID event_day event
60 1 [Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Hack: Syllabus for Research Methods (all day)
41 1 Workshop: Fundamentals of meta-analysis (9:30a-12:30p), Workshop: Using OSF (1:30-3:30p), Workshop: Writing papers to be transparent, reproducible, and fabulous (3:30-5:30p)
  • Side effect to note for later: some people are now planning for NA. Why?

Mess #3: Multiple records in a single row

  • As many plans as there were checkboxes
  • Doing any analysis of event attendance is hard

Tidy solution: One row per one intention to attend an event

# not shown: function that escapes non-separator commas
x$event <- escape_multiple_values(x$event)
x <- separate_rows(x, event, sep = ', ')
# unescape the event titles
x$event <- unescape_multiple_values(x$event)

We can now easily fix the mis-dating of Brian’s social event, which - as an artifact of survey design - would otherwise still be considered a Day 1 event:

# With just one mis-dated event, we can just re-date it
x$event_day[x$event == paste0("[Saturday, July 29th, 7p] ", 
  "Dinner/drinks party at Brian Nosek's home (bus ",
  "transportation to/from Omni hotel all evening)")] <- 0

Mess #4: Multiple variables in a single column

participant_ID event_day event
103 1 Workshop: Fundamentals of meta-analysis (9:30a-12:30p)

What’s wrong here?

  • All mashed up in event: event time, event name, and event type
# Structure of `event` is stable (Type: name (time)), so 
# a regular expression can extract it
#
# (Not shown: slight data manipulation to make this work 
#  for all events)
x <- extract(x, event, 
             c('event_type', 'event_name', 'event_time'), 
             regex = "(^[[:alnum:]-]+): (.+) \\((.+)\\)$", 
             remove = FALSE)
participant_ID event_day event event_type event_name event_time
103 1 Workshop: Fundamentals of meta-analysis (9:30a-12:30p) Workshop Fundamentals of meta-analysis 9:30a-12:30p

Mess #5: Multiple types of experimental unit in the same table

  • “I’m flexible” isn’t a plan for a specific event.
  • Question: Should such responses be removed or re-defined? When should curation remove responses?
    • Relevant: What does a plan for NA event mean?
participant_ID event_day event event_type event_name event_time
57 1 I am flexible; if there is a hackathon that needs help, let me know. NA NA NA
flexible_responses <- c("Everything else will be scheduled during the conference depending on what has legs and group interest", 
                        "I am flexible; if there is a hackathon that needs help, let me know.")

# Removal:
# x <- filter(x, !(event %in% flexible_responses))

# Re-definition
x$event_type[x$event %in% flexible_responses] <- "Flexible"
participant_ID event_day event event_type event_name event_time
57 1 I am flexible; if there is a hackathon that needs help, let me know. Flexible NA NA

Mess #6: One type of experimental unit stored in multiple tables

  • Bullet we dodged, but can you imagine examples?
  • If the Plans GForm asked for participant dietary restrictions, and the Lightning Talk GForm asked for participant age, both should be extracted into logical places

Could we make things even cleaner?

  • Add event_date
  • Convert event_time to event_starttime and event_duration
  • Remove event, as we’ve extracted everything out of it
  • In fact, separate out events.csv into its own table and join by event_id

Real artists ship publish

x <- select(x, -event)
write_csv(x, 'data/clean/event_plans.csv')
write_csv(people, 'data/clean/people.csv')
save(people, x, file = 'data/clean/all_data.Rdata')
participant_ID event_day event_type event_name event_time
103 1 Workshop Fundamentals of meta-analysis 9:30a-12:30p

Things are easier with a tidy data set

Tidy data are like a toothpaste

  • A tidy dataset is easy to push into “messy” forms, but backwards doesn’t work quite as well.

Quick exploration is easy

Tidy tools

It’s not just R

…but if you’re looking for R, tidyverse is excellent

Your tidy woes?

Part 2: Metadata

  • How the data was collected & why — study- and table-level documentation
  • Follows naturally from a clean dataset
  • Allows for usability and findability

Always be documenting

(source: Wendy Thomas and Marcel Hebing 2013 presentation on DDI)

Option 1: Standardized metadata

Standardized metadata

(source: https://xkcd.com/927/)

More on DDI

Pros

  • Very, very thoroughly standardized & documented
  • XML means machine-readability
  • Connects into other existing frameworks
  • Ideal for extensive datasets

Cons

  • Far too thoroughly standardized - steep learning curve
  • XML does not mean human readability
  • Overkill for small-ish datasets

Tools

Option 2: “Readme” metadata

“Readme” metadata

Documenting our SIPS dataset

We’ll be using the Cornell template

Basics

  1. Title of Dataset: Preliminary plans of SIPS 2017 attendeees
  2. Author Information
    • Principal Investigator Contact Information
    • Associate or Co-investigator Contact Information
  3. Date of data collection (single date, range, approximate date) 20170713-20170725

Basics (cont’d)

  1. Geographic location of data collection (where was data collected?): online
  2. Information about funding sources that supported the collection of the data: COS/SIPS and its funders

Sharing / access information

  1. Licenses/restrictions placed on the data: MIT
  2. Links to publications that cite or use the data: -
  3. Links to other publicly accessible locations of the data: -
  4. Links/relationships to ancillary data sets: SIPS 2017 registrations
  5. Was data derived from another source? -
  6. Recommended citation for the data: We’ll soon have one!

Data and file overview

  1. File List
    1. Filename: people.csv
      • Short description: Properties of SIPS attendees
    2. Filename: event_plans.csv
      • Short description: Plans for events by participant, with description of events
  2. Relationship between files: people.csv provides participant data for event_plans.csv (one-to-many)
  3. Additional related data collected that was not included in the current data package: flexible plans, participant registration

Data and file overview (cont’d)

  1. Are there multiple versions of the dataset? no
    • If yes, list versions: -
    • Name of file that was updated: -
      1. Why was the file updated? -
      2. When was the file updated? -

Question: is this necessary if you use version control?

Methodological information

  1. Description of methods used for collection/generation of data (include references): Google Form sent to registered participants via e-mail
  2. Methods for processing the data: Export to CSV, then run tidying_sips_plans.Rmd
  3. Instrument- or software-specific information needed to interpret the data: none, but R 3.4+ preferred
  4. Standards and calibration information, if appropriate: -

Methodological information (cont’d)

  1. Environmental/experimental conditions: -
  2. Describe any quality-assurance procedures performed on the data: -
  3. People involved with sample collection, processing, analysis and/or submission: Katie & Brian & … (collection), Simon (processing & analysis), David (submission)

event_plans.csv

  1. Number of variables: 5
  2. Number of cases/rows: 1260
  3. Variable List
    1. Name: participant_ID
      • Description: unique participant index linked to the person data in people.csv
    2. Name: event_type
      • Description: The format of the event the participant is planning to attend
      • Possible values: Social, Hack, Re-hack, Workshop, Flexible, NA (and meanings)
    3. Name:

event_plans.csv (cont’d)

  1. Missing data codes: -
  2. Specialized formats of other abbreviations used: -

people.csv

Resources