The steps needed to get a shareable dataset!
Write code for humans. Write data for computers.
– Vince Buffalo, Bioinformatics Data Skills
data/raw/
data/clean/
processing/
analysis/
README
We will be sharing two files at minimum
Preferred:
(source: Chapter 12 of R for Data Science)
work_phone
vs. home_phone
)Messy data is usually messy in the following ways:
We’ll go over each in the example.
Hadley Wickham’s (2014) Tidy Data paper. Do read it if you can.
(much of this presentation is modeled on his original slides)
R
heavy, but many tools do the same)knitr::kable(x[106, 1:4], row.names = FALSE)
Timestamp | Name | Email address | Which events do you plan to attend on Sunday, July 30? |
---|---|---|---|
7/17/2017 10:45:31 | Fred Satterfield | px@kiwabmx.bzq | [Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Workshop: Fundamentals of meta-analysis (9:30a-12:30p), Workshop: Using OSF (1:30-3:30p), Workshop: Writing papers to be transparent, reproducible, and fabulous (3:30-5:30p), Hack: Resources for changing culture in department, university, society (all day) |
Which events do you plan to attend on Monday, July 31? (not including lightning talks, unconference sessions, and hackathons that will continue or start on 2nd day) | Which events do you plan to attend on Tuesday, August 1st? |
---|---|
Workshop: R Markdown (1:30-3:30p), Workshop: Power analysis and effect size (3:30p-5:30p), All Conference Social Event (Dinner+drinks; 7:30p-??) | Workshop: Fundamentals of R (9:30a-12:30p) |
# Read the file in with better headers
library(readr)
x <- read_csv('data/raw/plans_raw.csv',
col_names = c("time_submitted", "name", "email",
"day1", "day2", "day3"),
col_types = list(
col_datetime(format = '%m/%d/%Y %H:%M:%S'),
col_character(), col_character(),
col_character(), col_character(),
col_character()),
skip = 1) # skip row with headers
head(x)
## # A tibble: 6 x 6
## time_submitted name email
## <dttm> <chr> <chr>
## 1 2017-07-14 11:54:46 Bernadine Ratke ad@xosz.tdi
## 2 2017-07-14 10:13:10 Juliane Kling jbacqmnoug@qcsrv.hut
## 3 2017-07-14 10:14:02 Argelia DuBuque sbinxjw@bprhc.ybo
## 4 2017-07-14 10:16:39 Williams Mueller xrugcmajh@qbaetkvwmj.kwe
## 5 2017-07-14 11:17:02 Davida Ledner iokv@vl.dyt
## 6 2017-07-16 03:07:14 Ethan Klocko kl@jbn.vmb
## # ... with 3 more variables: day1 <chr>, day2 <chr>, day3 <chr>
knitr::kable(x[106, 4:6], row.names = FALSE)
day1 | day2 | day3 |
---|---|---|
[Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Workshop: Fundamentals of meta-analysis (9:30a-12:30p), Workshop: Using OSF (1:30-3:30p), Workshop: Writing papers to be transparent, reproducible, and fabulous (3:30-5:30p), Hack: Resources for changing culture in department, university, society (all day) | Workshop: R Markdown (1:30-3:30p), Workshop: Power analysis and effect size (3:30p-5:30p), All Conference Social Event (Dinner+drinks; 7:30p-??) | Workshop: Fundamentals of R (9:30a-12:30p) |
Now, let’s talk about what makes this messy.
Tidy solution: divide the data units into tables you can merge as needed
people <- select(x, time_submitted, name, email) %>% unique()
people$data_origin <- 'plans_raw.csv' # if many possible sources
# if non-existent, create an ID to join the tables by.
# (Usually, you might have an ID for each participant at
# this point, which you'd now deidentify.)
people$participant_ID = sample(1:nrow(people), nrow(people))
# remove the people-specific data from `x`
x <- merge(x, people) %>% select(-time_submitted, -name,
-email, -data_origin)
As a bonus, sensitive info is relegated to a single spot, and is easier to deal with - for example, extract non-identifying features of interest:
# dplyr::transmute drops all columns that are not explicitly named
people <- transmute(people,
time_submitted, participant_ID,
email_server = gsub('.+@([^@]+)$', '\\1', email),
number_names = length(strsplit(name, ' ')[[1]])
)
time_submitted | participant_ID | email_server | number_names |
---|---|---|---|
2017-07-14 11:54:46 | 88 | xosz.tdi | 2 |
Of course, this might still be identifying information - consult your local HIPAA board.
What are you planning to do on August 1
to day1
, that is still plan-level info.day1 | day2 | day3 | participant_ID |
---|---|---|---|
[Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Hack: Syllabus for Research Methods (all day) | Workshop: R Markdown (1:30-3:30p), Re-hack: Diversity and inclusion (morning), All Conference Social Event (Dinner+drinks; 7:30p-??) | NA | 60 |
Tidy solution: translate the data from wide to long
x <- tidyr::gather(x, event_day, event, -participant_ID)
x$event_day <- as.numeric(gsub('day', '', x$event_day)) # clean up
participant_ID | event_day | event |
---|---|---|
60 | 1 | [Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Hack: Syllabus for Research Methods (all day) |
41 | 1 | Workshop: Fundamentals of meta-analysis (9:30a-12:30p), Workshop: Using OSF (1:30-3:30p), Workshop: Writing papers to be transparent, reproducible, and fabulous (3:30-5:30p) |
NA
. Why?Tidy solution: One row per one intention to attend an event
# not shown: function that escapes non-separator commas
x$event <- escape_multiple_values(x$event)
x <- separate_rows(x, event, sep = ', ')
# unescape the event titles
x$event <- unescape_multiple_values(x$event)
We can now easily fix the mis-dating of Brian’s social event, which - as an artifact of survey design - would otherwise still be considered a Day 1 event:
# With just one mis-dated event, we can just re-date it
x$event_day[x$event == paste0("[Saturday, July 29th, 7p] ",
"Dinner/drinks party at Brian Nosek's home (bus ",
"transportation to/from Omni hotel all evening)")] <- 0
participant_ID | event_day | event |
---|---|---|
103 | 1 | Workshop: Fundamentals of meta-analysis (9:30a-12:30p) |
What’s wrong here?
event
: event time, event name, and event type# Structure of `event` is stable (Type: name (time)), so
# a regular expression can extract it
#
# (Not shown: slight data manipulation to make this work
# for all events)
x <- extract(x, event,
c('event_type', 'event_name', 'event_time'),
regex = "(^[[:alnum:]-]+): (.+) \\((.+)\\)$",
remove = FALSE)
participant_ID | event_day | event | event_type | event_name | event_time |
---|---|---|---|---|---|
103 | 1 | Workshop: Fundamentals of meta-analysis (9:30a-12:30p) | Workshop | Fundamentals of meta-analysis | 9:30a-12:30p |
NA
event mean?participant_ID | event_day | event | event_type | event_name | event_time |
---|---|---|---|---|---|
57 | 1 | I am flexible; if there is a hackathon that needs help, let me know. | NA | NA | NA |
flexible_responses <- c("Everything else will be scheduled during the conference depending on what has legs and group interest",
"I am flexible; if there is a hackathon that needs help, let me know.")
# Removal:
# x <- filter(x, !(event %in% flexible_responses))
# Re-definition
x$event_type[x$event %in% flexible_responses] <- "Flexible"
participant_ID | event_day | event | event_type | event_name | event_time |
---|---|---|---|---|---|
57 | 1 | I am flexible; if there is a hackathon that needs help, let me know. | Flexible | NA | NA |
event_date
event_time
to event_starttime
and event_duration
event
, as we’ve extracted everything out of itevents.csv
into its own table and join by event_id
x <- select(x, -event)
write_csv(x, 'data/clean/event_plans.csv')
write_csv(people, 'data/clean/people.csv')
save(people, x, file = 'data/clean/all_data.Rdata')
participant_ID | event_day | event_type | event_name | event_time |
---|---|---|---|---|
103 | 1 | Workshop | Fundamentals of meta-analysis | 9:30a-12:30p |
…but if you’re looking for R, tidyverse
is excellent
(source: Wendy Thomas and Marcel Hebing 2013 presentation on DDI)
(source: https://xkcd.com/927/)
people.csv
event_plans.csv
people.csv
provides participant data for event_plans.csv
(one-to-many)Question: is this necessary if you use version control?
tidying_sips_plans.Rmd
event_plans.csv
participant_ID
people.csv
event_type
NA
(and meanings)event_plans.csv
(cont’d)people.csv
…