2018-05-04

Reshaping with tidyr

The tools from the tidyr package help you re-arrange tabular data stored in data frames, with an emphasis on helping you normalize datasets.

There are 4 main functions for re-arranging data in the tidyr package:

  1. gather(): transforms data from wide(r) to long(er)
  2. spread(): transforms data from long(er) to wide(r)
  3. separate(): splits a single column into multiple columns
  4. unite(): combines multiple columns into a single column

The temperature data

We'll demonstrate how to use each function with a small dataset, so its easy to see what changes get made. So, install tidyr, and load the temperature dataset.

install.packages("tidyr")
library(tidyr)
temperature <- read.csv("https://wjhopper.github.io/psych640-labs/data/temperature.csv")
temperature
High Temperatures (Farenheight)
city Feb_19_2017 Feb_18_2017 Feb_17_2017
Northampton 55 48 38
Boston 59 48 40

The temperature data

High Temperatures (Farenheight)
city Feb_19_2017 Feb_18_2017 Feb_17_2017
Northampton 55 48 38
Boston 59 48 40

The temperature dataset has measurements of 3 variables: City, Date, and Temperature.

But the layout of the dataset is non-normalized, because values of the Date variable (Feb_19_2017, Feb_18_2017, and Feb_17_2017) are used as unique variables themselves.

Instead, the dates should be in the cells of the table as observed values of the Date variable.

The gather() function

Let's use the gather function to move the dates in the column headers down into the table under a new variable date, and move the actual temperature measurements into a new variable called temp.

gather takes 4 required arguments:

  1. data: the name of the data frame to re-arrange
  2. key: the name of the to-be-created variable that will hold values currently in use as variables
  3. value: the name of the to-be-created variable that will hold the values from the variables to be gathered.
  4. ...: A bare list of columns to gather (or columns not to gather)

The gather() function

Let's make this more specific to our situation.

gather(data = temperature,
       key = "date",
       value = "temp",
       -city)
  • The key column will be named "date", because it will hold the dates that were previously used as individual variables
  • The value column will be named "temp" because the values in each of the Feb_19_2017, Feb_18_2017, and Feb_17_2017 variables were our actual temperature measurements.
  • We want to gather values from all the columns except the city column, so we can write -city. Alternatively, we could write Feb_19_2017, Feb_18_2017, and Feb_17_2017.

The gather() function

So this:

temperature <- gather(data = temperature, key = "date", value = "temp", -city)

turns this:

city Feb_19_2017 Feb_18_2017 Feb_17_2017
Northampton 55 48 38
Boston 59 48 40

into this:

city date temp
Northampton Feb_19_2017 55
Boston Feb_19_2017 59
Northampton Feb_18_2017 48
Boston Feb_18_2017 48
Northampton Feb_17_2017 38
Boston Feb_17_2017 40

The gather() function

Critically, the temperature observations are still matched up with the same date!

city Feb_19_2017 Feb_18_2017 Feb_17_2017
Northampton 55 48 38
Boston 59 48 40
city date temp
Northampton Feb_19_2017 55
Boston Feb_19_2017 59
Northampton Feb_18_2017 48
Boston Feb_18_2017 48
Northampton Feb_17_2017 38
Boston Feb_17_2017 40

The spread() function

The spread function is the complement to the gather function, so we can use it to undo this transformation.

Its job is to takes values from an existing variable in the dataset and "spread" them out, creating a new variable for each unique value in the old variable.

The spread function takes 3 required arguments:

  1. data: the name of the data frame to re-arrange
  2. key: the name of the existing variable whose N unique values will be "spread out" into N new variables
  3. value: the name of the existing variable whose values will be used as values in the N to-be-created variables.

The spread() function

So to undo the wider-to-longer normalization we just performed, we would use this function call:

spread(data = temperature,
       key = "date",
       value = "temp")
  • the key argument is the "date" column, because the "date" column holds the values we want to be new variables
  • the value argument is the "temp" column, because we want the temperature measurements to be observations in the 3 date variables

We don't have to specify the names of the new variable we're creating, because they are going to come from the values in the "date" variable.

The spread() function

So this:

spread(data = temperature, key = "date", value = "temp")

turns this:

city date temp
Northampton Feb_19_2017 55
Boston Feb_19_2017 59
Northampton Feb_18_2017 48
Boston Feb_18_2017 48
Northampton Feb_17_2017 38
Boston Feb_17_2017 40

into this:

city Feb_17_2017 Feb_18_2017 Feb_19_2017
Boston 40 48 59
Northampton 38 48 55

Common spread() problems

A common issue encountered with the spread function is the somewhat mysterious error "Duplicate identifiers for rows … ".

Basically, this means is that when the column given as the value argument is removed, the rows of your data frame are not unique. Thus, spread can't figure out which observations belong in each row when it reshapes that data. Rather than potentially corrupt your data, it gives up instead.

In general, the solution is to give each row a unique ID number like so:

temperature$id <- 1:nrow(temperature)

Common spread() problems

For example, I can't spread the X variable in this data frame into new A and B variables.

X Y Z
A 1 106
A 1 68
B 1 90
B 2 140
spread(d, key=X, value=Z)
## Error: Duplicate identifiers for rows (1, 2)

The problem is that when trying to form the A column, it's ambigious whether the 106 or the 68 should go in the first row and get paired up with 90, because both 106 and 68 have a Y value of 1.

Common spread() problems

If we add an observation identifier within A and B, we can reshape the data.

X Y Z id
A 1 106 1
A 1 68 2
B 1 90 1
B 2 140 2
spread(d, key=X, value=Z)
Y id A B
1 1 106 90
1 2 68 NA
2 2 NA 140

The separate() function

The separate function is used to transform one variable in your dataset into many variables in your dataset. This is useful when multiple variables have been concatenated into a single variable, and you would like to examine these variables separately.

For example, in our normalized temperature dataset, the date variable is actually a combination of a month, day of month, and year variable, concatenated together.

city date temp
Northampton Feb_19_2017 55
Boston Feb_19_2017 59
Northampton Feb_18_2017 48
Boston Feb_18_2017 48
Northampton Feb_17_2017 38
Boston Feb_17_2017 40

The separate() function

Let's separate this amalgamated "date" variable into 3 different variables, one for the month, day, and year.

The separate function has 3 required arguments, and a 4th optional argument you almost may want to specify.

  1. data: the name of the data frame to re-arrange
  2. col: the bare name of the variable to be split-up
  3. into: a character vector of names for the to-be-created variables
  4. sep: the delimiter defining how to split existing values into multiple new values.

By default, separate will split the values in the col variable on any non-alphanumeric character (e.g., _, -, ., :, etc.). You can have it split on a character string, a numeric index (e.g., every 3rd character), or using a regular expression.

The separate() function

In our case, we would use the separate function like so:

separate(data = temperature,
         col = date,
         into = c("month", "day", "year"),
         sep = "_")
  • We want to separate the column named date into 3 new variables named "month", "day", and "year"
  • We are going to find the values that should go into the month, day, and year variables by splitting up the values in the date variable at every underscore character.

The separate() function

So this:

temperature <- separate(temperature, col=date, into=c("month", "day", "year"), sep = "_")

turns this:

city date temp
Northampton Feb_19_2017 55
Boston Feb_19_2017 59
Northampton Feb_18_2017 48

into this:

city month day year temp
Northampton Feb 19 2017 55
Boston Feb 19 2017 59
Northampton Feb 18 2017 48

The unite() function

The unite function is the complement to the separate function. As you can imagine, its job is to combine multiple variables into a single variable.

The unite function has 3 required arguments, and 2 other arguments you often want to specify:

  1. data: the name of the data frame to re-arrange
  2. col: the bare name of the to-be-created amalgam variable
  3. ... : bare columns whose values are to be combined row-wise and placed inside the new variable col
  4. sep: the separator to use between combined values (defaults to "_").
  5. remove: a logical value indicating whether or not to remove columns from the data frame after combining them (defaults to TRUE).

The unite() function

So to undo our separation of the date variable, we could use this:

unite(temperature, col=date, month, day, year)

to turn this:

city month day year temp
Northampton Feb 19 2017 55
Boston Feb 19 2017 59
Northampton Feb 18 2017 48

back into this:

city date temp
Northampton Feb_19_2017 55
Boston Feb_19_2017 59
Northampton Feb_18_2017 48

Resources & More

For more information about the theory behind the tidyr package, check out the package vignette and the tidy data paper.

You may have noticed that the API (i.e., standard syntax) for tidyr functions is not unlike that of the functions in dplyr. This is not a coincidence; both packages have the same core authors and are part of the "tidyverse" collection of R packages.

However, tidyr is not the only choice for reshaping data. The reshape2 package is also quite useful. Its focus is on reshaping more broadly (e.g., from one type of data structure to another, reshaping through aggregation, etc.). If you need a more general reshaping tool, start with the following reshape2 tutorials:

Activity

Normalize the iris dataset using tidyr functions.

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

Go!