Reshaping Datasets with tidyr

2018-05-04

Reshaping with `tidyr`

The tools from the tidyr package help you re-arrange tabular data stored in data frames, with an emphasis on helping you normalize datasets.

There are 4 main functions for re-arranging data in the tidyr package:

gather(): transforms data from wide(r) to long(er)
spread(): transforms data from long(er) to wide(r)
separate(): splits a single column into multiple columns
unite(): combines multiple columns into a single column

The temperature data

We'll demonstrate how to use each function with a small dataset, so its easy to see what changes get made. So, install tidyr, and load the temperature dataset.

install.packages("tidyr")
library(tidyr)
temperature <- read.csv("https://wjhopper.github.io/psych640-labs/data/temperature.csv")
temperature

High Temperatures (Farenheight)
city	Feb_19_2017	Feb_18_2017	Feb_17_2017
Northampton	55	48	38
Boston	59	48	40

The temperature data

High Temperatures (Farenheight)
city	Feb_19_2017	Feb_18_2017	Feb_17_2017
Northampton	55	48	38
Boston	59	48	40

The temperature dataset has measurements of 3 variables: City, Date, and Temperature.

But the layout of the dataset is non-normalized, because values of the Date variable (Feb_19_2017, Feb_18_2017, and Feb_17_2017) are used as unique variables themselves.

Instead, the dates should be in the cells of the table as observed values of the Date variable.

The `gather()` function

Let's use the gather function to move the dates in the column headers down into the table under a new variable date, and move the actual temperature measurements into a new variable called temp.

gather takes 4 required arguments:

data: the name of the data frame to re-arrange
key: the name of the to-be-created variable that will hold values currently in use as variables
value: the name of the to-be-created variable that will hold the values from the variables to be gathered.
...: A bare list of columns to gather (or columns not to gather)

The `gather()` function

Let's make this more specific to our situation.

gather(data = temperature,
       key = "date",
       value = "temp",
       -city)

The key column will be named "date", because it will hold the dates that were previously used as individual variables
The value column will be named "temp" because the values in each of the Feb_19_2017, Feb_18_2017, and Feb_17_2017 variables were our actual temperature measurements.
We want to gather values from all the columns except the city column, so we can write -city. Alternatively, we could write Feb_19_2017, Feb_18_2017, and Feb_17_2017.

The `gather()` function

So this:

temperature <- gather(data = temperature, key = "date", value = "temp", -city)

turns this:

city	Feb_19_2017	Feb_18_2017	Feb_17_2017
Northampton	55	48	38
Boston	59	48	40

into this:

city	date	temp
Northampton	Feb_19_2017	55
Boston	Feb_19_2017	59
Northampton	Feb_18_2017	48
Boston	Feb_18_2017	48
Northampton	Feb_17_2017	38
Boston	Feb_17_2017	40

The `gather()` function

Critically, the temperature observations are still matched up with the same date!

city	Feb_19_2017	Feb_18_2017	Feb_17_2017
Northampton	55	48	38
Boston	59	48	40

city	date	temp
Northampton	Feb_19_2017	55
Boston	Feb_19_2017	59
Northampton	Feb_18_2017	48
Boston	Feb_18_2017	48
Northampton	Feb_17_2017	38
Boston	Feb_17_2017	40

The `spread()` function

The spread function is the complement to the gather function, so we can use it to undo this transformation.

Its job is to takes values from an existing variable in the dataset and "spread" them out, creating a new variable for each unique value in the old variable.

The spread function takes 3 required arguments:

data: the name of the data frame to re-arrange
key: the name of the existing variable whose N unique values will be "spread out" into N new variables
value: the name of the existing variable whose values will be used as values in the N to-be-created variables.

The `spread()` function

So to undo the wider-to-longer normalization we just performed, we would use this function call:

spread(data = temperature,
       key = "date",
       value = "temp")

the key argument is the "date" column, because the "date" column holds the values we want to be new variables
the value argument is the "temp" column, because we want the temperature measurements to be observations in the 3 date variables

We don't have to specify the names of the new variable we're creating, because they are going to come from the values in the "date" variable.

The `spread()` function

So this:

spread(data = temperature, key = "date", value = "temp")

turns this:

city	date	temp
Northampton	Feb_19_2017	55
Boston	Feb_19_2017	59
Northampton	Feb_18_2017	48
Boston	Feb_18_2017	48
Northampton	Feb_17_2017	38
Boston	Feb_17_2017	40

into this:

city	Feb_17_2017	Feb_18_2017	Feb_19_2017
Boston	40	48	59
Northampton	38	48	55

Common `spread()` problems

A common issue encountered with the spread function is the somewhat mysterious error "Duplicate identifiers for rows … ".

Basically, this means is that when the column given as the value argument is removed, the rows of your data frame are not unique. Thus, spread can't figure out which observations belong in each row when it reshapes that data. Rather than potentially corrupt your data, it gives up instead.

In general, the solution is to give each row a unique ID number like so:

temperature$id <- 1:nrow(temperature)

Common `spread()` problems

For example, I can't spread the X variable in this data frame into new A and B variables.

X	Y	Z
A	1	106
A	1	68
B	1	90
B	2	140

spread(d, key=X, value=Z)

## Error: Duplicate identifiers for rows (1, 2)

The problem is that when trying to form the A column, it's ambigious whether the 106 or the 68 should go in the first row and get paired up with 90, because both 106 and 68 have a Y value of 1.

Common `spread()` problems

If we add an observation identifier within A and B, we can reshape the data.

X	Y	Z	id
A	1	106	1
A	1	68	2
B	1	90	1
B	2	140	2

spread(d, key=X, value=Z)

Y	id	A	B
1	1	106	90
1	2	68	NA
2	2	NA	140

The `separate()` function

The separate function is used to transform one variable in your dataset into many variables in your dataset. This is useful when multiple variables have been concatenated into a single variable, and you would like to examine these variables separately.

For example, in our normalized temperature dataset, the date variable is actually a combination of a month, day of month, and year variable, concatenated together.

city	date	temp
Northampton	Feb_19_2017	55
Boston	Feb_19_2017	59
Northampton	Feb_18_2017	48
Boston	Feb_18_2017	48
Northampton	Feb_17_2017	38
Boston	Feb_17_2017	40

The `separate()` function

Let's separate this amalgamated "date" variable into 3 different variables, one for the month, day, and year.

The separate function has 3 required arguments, and a 4^th optional argument you almost may want to specify.

data: the name of the data frame to re-arrange
col: the bare name of the variable to be split-up
into: a character vector of names for the to-be-created variables
sep: the delimiter defining how to split existing values into multiple new values.

By default, separate will split the values in the col variable on any non-alphanumeric character (e.g., _, -, ., :, etc.). You can have it split on a character string, a numeric index (e.g., every 3^rd character), or using a regular expression.

The `separate()` function

In our case, we would use the separate function like so:

separate(data = temperature,
         col = date,
         into = c("month", "day", "year"),
         sep = "_")

We want to separate the column named date into 3 new variables named "month", "day", and "year"
We are going to find the values that should go into the month, day, and year variables by splitting up the values in the date variable at every underscore character.

The `separate()` function

So this:

temperature <- separate(temperature, col=date, into=c("month", "day", "year"), sep = "_")

turns this:

city	date	temp
Northampton	Feb_19_2017	55
Boston	Feb_19_2017	59
Northampton	Feb_18_2017	48
…	…	…

into this:

city	month	day	year	temp
Northampton	Feb	19	2017	55
Boston	Feb	19	2017	59
Northampton	Feb	18	2017	48
…	…	…	…	…

The `unite()` function

The unite function is the complement to the separate function. As you can imagine, its job is to combine multiple variables into a single variable.

The unite function has 3 required arguments, and 2 other arguments you often want to specify:

data: the name of the data frame to re-arrange
col: the bare name of the to-be-created amalgam variable
... : bare columns whose values are to be combined row-wise and placed inside the new variable col
sep: the separator to use between combined values (defaults to "_").
remove: a logical value indicating whether or not to remove columns from the data frame after combining them (defaults to TRUE).

The `unite()` function

So to undo our separation of the date variable, we could use this:

unite(temperature, col=date, month, day, year)

to turn this:

city	month	day	year	temp
Northampton	Feb	19	2017	55
Boston	Feb	19	2017	59
Northampton	Feb	18	2017	48
…	…	…	…	…

back into this:

city	date	temp
Northampton	Feb_19_2017	55
Boston	Feb_19_2017	59
Northampton	Feb_18_2017	48
…	…	…

Resources & More

For more information about the theory behind the tidyr package, check out the package vignette and the tidy data paper.

You may have noticed that the API (i.e., standard syntax) for tidyr functions is not unlike that of the functions in dplyr. This is not a coincidence; both packages have the same core authors and are part of the "tidyverse" collection of R packages.

However, tidyr is not the only choice for reshaping data. The reshape2 package is also quite useful. Its focus is on reshaping more broadly (e.g., from one type of data structure to another, reshaping through aggregation, etc.). If you need a more general reshaping tool, start with the following reshape2 tutorials:

Activity

Normalize the iris dataset using tidyr functions.

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
…	…	…	…	…

Go!

Reshaping with tidyr

The temperature data

The temperature data

The gather() function

The gather() function

The gather() function

The gather() function

The spread() function

The spread() function

The spread() function

Common spread() problems

Common spread() problems

Common spread() problems

The separate() function

The separate() function

The separate() function

The separate() function

The unite() function

The unite() function

Resources & More

Activity

Reshaping with `tidyr`

The `gather()` function

The `gather()` function

The `gather()` function

The `gather()` function

The `spread()` function

The `spread()` function

The `spread()` function

Common `spread()` problems

Common `spread()` problems

Common `spread()` problems

The `separate()` function

The `separate()` function

The `separate()` function

The `separate()` function

The `unite()` function

The `unite()` function