Reshaping Datasets with reshape2

2018-05-04

Reshaping your data

Let get into reshaping some data with using the reshape2 package. Install that package now with install.packages("reshape2") and load it with library(reshape2).

A word of caution: don't confuse the reshape2 package with the reshape function in base R - they are unrelated.

Next, download and import the two datasets we will be practicing with: the hare-lynx trapping data set and the familiar loci/image/rhyme memory experiment data.

hares <- read.csv("http://wjhopper.github.io/psych640-labs/data/hare_lynx.csv")
memory <- read.csv("http://wjhopper.github.io/psych640-labs/data/memorydata.csv")

If you have not read the "Data Semantics" slides, I strongly recommend you go back and read those before continuing on.

Melting and casting

Changing a dataset's layout with tools from reshape2 revolves around two operations, melting and casting.

Melting is the process of transforming data into long format by moving variables from column headers into cells of the table. This is also called stacking, or gathering, and reshape2 provides the melt function to perform this task.

Casting is the process of transforming data into into wide(er) format, also called spreading. reshape2 provides the dcast and acast function to create reshaped data frames and arrays/matrices, respectively.

Reshaping often begins by melting data into long format, so we will begin by using the melt function with the hares-lynx data.

The hare-lynx dataset

This dataset records the number of hares and lynx trapped in North Canada from 1900 to 1920.

Year	Hares	Lynx
1900	30	4
1901	47.2	6.1
1902	70.2	9.8
1903	77.4	35.2
1904	36.3	59.4
1905	20.6	41.7
…	…	…

There are 3 variables here: Year, Animal Type, and Number Trapped.

This is a wide dataset, because it has a column for each variable. But it is non-normalized, because hare and lynx are used as variables, when they should be used in the cells of the table as the values the variable Animal can take.

Stacking Data with `melt`

The melt function takes values that are stored in column headers and moves them into rows in the data set, while preserving their relationships to other values. It has 4 key arguments:

id.vars: A vector defining which columns of the data should be used as ID variables. This can be a numeric vector of column numbers, or character vector of column names.
measure.vars: A vector defining which columns of the data hold measured variables. This can also be a numeric vector of column numbers, or character vector of column names.
variable.name: A character vector used to name the new column in the reshaped data that holds the values shifted from the column headers down into the rows.
value.name: A character vector used to name the new column in the reshaped data that holds the measured values.

Stacking Data with `melt`

To normalize our data, we need to shift Hare and Lynx from measured variables into values, so we set measure.vars = c("Hares","Lynx"). We also need to preserve the Year column as an ID variable, so we set id.vars = "Year".

Since "Hare" and "Lynx" are values the Animal Type variable takes on, we'll call the new column holding these values "Animal" by setting variable.name = "Animal".

The numeric values from the Hare and Lynx columns measure the "Number Trapped" variable, so we'll name the new column holding them "Trapped" by setting value.name = "Trapped".

hareMelted <- melt(hares, id.vars="Year",
                 measure.vars=c("Hares","Lynx"),
                 variable.name="Animal", value.name="Trapped")

Stacking Data with `melt`

Lets compare the structure of the data before and after melting.

hareMelted <- melt(hares, id.vars="Year",
                 measure.vars=c("Hares","Lynx"),
                 variable.name="Animal", value.name="Trapped")

Original data
Year	Hares	Lynx
1900	30	4
1901	47.2	6.1
1902	70.2	9.8
1903	77.4	35.2
1904	36.3	59.4
1905	20.6	41.7
…	…	…

Reshaped normalized data
Year	Animal	Trapped
1900	Hares	30
1900	Lynx	4
1901	Hares	47.2
1901	Lynx	6.1
1902	Hares	70.2
1902	Lynx	9.8
…	…	…

Stacking Data with `melt`

If you omit either the id.vars or the measure.vars arguments, melt assumes all columns that were not specified as ID variables or measure variable belong to the other set.

Here, we only specify the value of id.vars, and melt assumes all other columns (Hares and Lynx) hold measure variables.

omitMeasure <- melt(hares, id.vars = "Year", variable.name="Animal",
                    value.name="Trapped")

Year	Animal	Trapped
1900	Hares	30
1901	Hares	47.2
1902	Hares	70.2
1903	Hares	77.4
1904	Hares	36.3
…	…	…

Stacking Data with `melt`

Here, we only specify the value of measure.vars, and melt assumes all other columns (Year) should be used as ID variables. Thus, we get the same result as the previous slide.

omitID <- melt(hares, measure.vars = c("Hares","Lynx"))

Year	variable	value
1900	Hares	30
1901	Hares	47.2
1902	Hares	70.2
1903	Hares	77.4
1904	Hares	36.3
…	…	…

We can also see that when we omit values for the variable.name and value.name arguments, melt defaults to using the generic terms "variable" and "value" for those column names.

Stacking Data with `melt`

If you omit both id.vars and measure.vars, melt assumes all columns holding character data or factors are ID variables, and any other columns are measure variables.

omitID <- melt(hares)

No id variables; using all as measure variables

variable	value
Year	1900
Year	1901
Year	1902
Year	1903
…	…

Since all columns in hares hold numeric data which get treated as measure variables, melt gives the warning seen above about having no ID variables.

Spreading Data with `dcast`

The dcast function is the inverse of the melt function: it takes a variable's values stored in rows of a dataset and making a new column for each one.

Using dcast can be tricky, since the form of the reshaped data is controlled with a grouping formula (like those used for aov). To cast your data correctly, remember these two rules:

On the left side of the ~ operator, list out the columns in the data frame which you want to remain as ID variables, each one separated with a +.
On the right side of the ~ operator, list the column in the data frame whose values should be "spread out" to form new variables.

Spreading Data with `dcast`

Lets use dcast to reshape our new normalized data set hareMelted back into its original shape.

Since the original data set had the Year variable as an ID variable, we'll put Year on the left side of the ~.

The original also used Hare and Lynx as separate measured variables instead of values. To recover this layout, we'll put the name of the column holding the Hare and Lynx values, in this case Animal, on the right side of the formula.

We also need to specify the name of the column holding our measured variable, so we'll set value.var = "Trapped". Here's the full call we'll use:

hareOrig <- dcast(hareMelted, formula = Year ~ Animal,
                  value.var="Trapped")

Spreading Data with `dcast`

Lets look at the resulting data frame - we got our original layout back!

hareOrig <- dcast(hareMelted, formula = Year ~ Animal,
                  value.var="Trapped")

hareMelted data frame
Year	Animal	Trapped
1900	Hares	30
1900	Lynx	4
1901	Hares	47.2
1901	Lynx	6.1
1902	Hares	70.2
1902	Lynx	9.8
…	…	…

Cast to original form
Year	Hares	Lynx
1900	30	4
1901	47.2	6.1
1902	70.2	9.8
1903	77.4	35.2
1904	36.3	59.4
1905	20.6	41.7
…	…	…

Spreading Data with `dcast`

The value.var argument is optional and if omitted, dcast will use some heuristics to guess which column holds the measured variable in your data.

If there is a column named "value" or "(all)" in your data frame, use that column as the measure variable.
Otherwise, guess that the final column is the measure variable.

Usually your measure variable should be the last column, but if you have multiple measure variables in the same data frame, these heuristics will not necessarily give you the desired result.

Activities

Using dcast, reshape the data frame hareMelted to a data frame which has Animal Type as an ID variable, and the years 1900 to 1920 as measured variables.
Download this dataset, which measures the quarterly earnings per Johnson & Johnson share from 1960 to 1980.

JJ <- read.csv("http://wjhopper.github.io/psych640-labs/data/JJ.csv")

Reshape it to use fiscal quarter as an ID variable instead of using Qtr1, Qtr2, Qtr3, and Qtr4 as measure variables.

Add an observation ID to the InsectSprays data frame:

InsectSprays$obsID <- rep(1:12,nrow(InsectSprays)/12)

Then reshape the InsectSprays data frame to use the values from the spray column as individual measure variables.

NB: Be specific about which column is the measured variable.

More reshaping

Now we'll work with the loci/image/rhyme memory data set, which I've modified to include a second between-subject factor, presentation duration.

memory <- read.csv("http://wjhopper.github.io/psych640-labs/data/memorydata.csv")

Subject	Method	Duration	Score
1	Control	2	11
1	Image	2	NA
1	Loci	2	NA
1	Rhyme	2	16
2	Control	6	NA
2	Image	6	NA
…	…	…	…

This data set is normalized, and holds 128 values across 40 unique observations from 3 ID variables (Subject, Method, and Duration) and 1 measure variable (Score). The observational unit is a subject in a condition of method and condition of duration.

More reshaping

Lets say we wanted to de-normalize this dataset to only use subject and method as ID variables. This layout would use more rows to hold the data, so we need the melt function to "lengthen" or "stack" the data.

memLong <- melt(memory, id.vars=c("Subject","Method"),
                measure.vars=c("Duration","Score"))

Subject	Method	variable	value
1	Control	Score	11
1	Control	Duration	2
1	Image	Score	NA
1	Image	Duration	2
1	Loci	Score	NA
1	Loci	Duration	2
1	Rhyme	Score	16
1	Rhyme	Duration	2
…	…	…	…

Melting missing values

Subject	Method	variable	value
1	Control	Score	11
1	Control	Duration	2
1	Image	Score	NA
1	Image	Duration	2
1	Loci	Score	NA
1	Loci	Duration	2
…	…	…	…

melt also preserves any missing values in the measure variables by default. You should always keep NA's when reshaping your data if they are explicitly missing values, meaning a value should have been sampled but was not (e.g., a survey question was not answered).

If a value is implictly missing, like an NA from a male respondent in a survey that asked about pregnancy history, then set the na.rm argument to TRUE.

When and when not to melt

Subject	Method	variable	value
1	Control	Score	11
1	Control	Duration	2
1	Image	Score	NA
1	Image	Duration	2
1	Loci	Score	NA
1	Loci	Duration	2
…	…	…	…

The usefulness of different layouts is contextual, but having the dataset in this form is not useful here because it mixes values from different variables in the same column (i.e., mixing accuracy scores and presentation durations in the values column).

Instead, the values of the presentation duration variable better aid analysis and interpretation when used as values in an ID variable.

Aggregation

Lets use dcast to reshape our long form data into a form that uses each of the four practice methods as measure variables. Using the grouping formula Subject ~ Method to reshape the data gives an unexpected result:

memWide <- dcast(memLong, Subject ~ Method, value.var="value")

Subject	Control	Image	Loci	Rhyme
1	2	2	2	2
2	2	2	2	2
3	2	2	2	2
4	2	2	2	2
…	…	…	…	…

We see the number 2 instead of scores and durations because our formula defined groups of more than one observation. dcast handles this situation by applying a function to summarize the observations with a single value (the default aggregation function is length).

Aggregation

To better understand why this happens, lets take a closer look at our formula. Subject ~ Method defines Subject as the only ID variable, and Method as the only measured variable.

Thus, we can read the formula as saying "For each subject, make a group of values from the value column for each unique value in Method column".

Outlining the groups our forumla defines
Subject	Method	variable	value
1	Control	Score	11
1	Control	Duration	2
1	Image	Score	NA
1	Image	Duration	2
…	…	…	…

Groups of observations aggregated by length()
Subject	Control	Image	Loci	Rhyme
1	2	2	2	2
2	2	2	2	2
3	2	2	2	2
…	…	…	…	…

These groups form the values for the new measure variables, but since the groups are too large they are first summarized with length, yielding our result of all twos.

Aggregation

If you really did want to layout your this way, but aren't interested in reporting the number of observation in each group, you have two options:

Supply the aggregation function you wish to use to summarize the groups of values by setting the fun.aggregate argument (e.g. fun.aggregate = mean).
- Only do this when it makes sense to combine values across variables. We should not do this, as combining duration and accuracy values doesn't make sense.
Subset the data to remove values from variables you're not interested in
- e.g.. memLong[memLong$variable != "Duration",]

Aggregating Wisely

It would be useful to know the average number of items recalled in each condition. We can use dcast with the appropriate grouping formula and aggregation function to find this out!

dcast(memory, Method ~ Duration, fun.aggregate = mean) #mean not in quotes!

Method	2	6
Control	6.5	NA
Image	NA	NA
Loci	NA	10
Rhyme	10.5	10.5

Our formula says "Take the scores at each unique value of Method and group them by their Duration". This makes 8 groups of 4 observations, and we set fun.aggregate = mean to summarize these four values with a single value, the mean.

Aggregating Wisely

If we would like to see the mean of each condition, ignoring the fact that some observations were not made when they should have been, we can include na.rm = TRUE.

dcast(memory, Method ~ Duration, fun.aggregate = mean, na.rm =TRUE)

Method	2	6
Control	6.5	5
Image	8.333	9
Loci	14.67	10
Rhyme	10.5	10.5

Formulas with > 2 variables

Alternatively, you may not have meant to aggregate the data at all and instead misspecified the grouping formula.

To avoid aggregating, you must include either all the columns, or all but 1 of the columns, in the grouping formula (making 3 the minimum number to include for this dataset).
Exactly where variables are used in the formula will affect the resulting layout
- All variables before the ~ will simply be used as ID variables.
- When multiple variables are used after the ~, the values from each variable are concatenated with each value from all other variables to form the column headers.

Formulas with > 2 variables

Subject & Duration as ID vars

dcast(memLong, Subject + variable ~ Method)

Subject	variable	Control	Image	Loci	Rhyme
1	Duration	2	2	2	2
1	Score	11	NA	NA	16
2	Duration	6	6	6	6
2	Score	NA	NA	9	14
3	Duration	2	2	2	2
3	Score	4	16	18	9
4	Duration	6	6	6	6
4	Score	2	13	8	16
…	…	…	…	…	…

Formulas with > 2 variables

Method and Duration values get combined in the column headers

dcast(memLong, Subject ~ Method + variable)

Table continues below
Subject	Control_Duration	Control_Score	Image_Duration	Image_Score	Loci_Duration
1	2	11	2	NA	2
2	6	NA	6	NA	6
3	2	4	2	16	2
4	6	2	6	13	6
…	…	…	…	…	…

Loci_Score	Rhyme_Duration	Rhyme_Score
NA	2	16
9	6	14
18	2	9
8	6	16
…	…	…

Special forumula terms

There are two special terms you can include in your grouping formulas in place of column names: ... and .

The ... term is shorthand for "all other variables", so using it makes your formula mean "group by everything else"
The . term means "no variables", so using it makes your formula mean "group by nothing"

Using …

Casting the memLong data frame using Subject + variable ~ Method and ... ~ Method give the same results because Subject and variable are the only remaining columns after specifying value.var as value.

dcast(memLong, ... ~ Method, value.var = 'value')

Subject	variable	Control	Image	Loci	Rhyme
1	Duration	2	2	2	2
1	Score	11	NA	NA	16
2	Duration	6	6	6	6
…	…	…	…	…	…

dcast(memLong, Subject + variable ~ Method )

Subject	variable	Control	Image	Loci	Rhyme
1	Duration	2	2	2	2
1	Score	11	NA	NA	16
2	Duration	6	6	6	6
…	…	…	…	…	…

Using .

Here we use the special term . to write a formula telling dcast to group all the the measured values for method of practice by nothing else!

dcast(memLong, . ~ Method )

.	Control	Image	Loci	Rhyme
.	16	16	16	16

Since there are multiple measurements for each methods, dcast applies the defaults aggregation function length to boil the observations down to a single value. The value 16 means 16 measurements were taken for each practice method.

Activities

Reshape the airquality dataset to only use month and day as ID variables. Should missing values be dropped or preserved?
Using the results for #1, layout the airquality dataset using the Months as measure variables, and use all other variables as ID variables.
Use dcast on the built in ToothGrowth data frame, find the standard deviation of the recorded tooth lengths at every combination of supplement type and dose level.
Using melt and then dcast on the built in DNase dataset, find the average protein concentration and optical density on each run of the DNA assay. The final data frame should have Run as an ID variable, and conc and density as measure variables.

More Resources

A brief intro but good intro to reshape2
The reshape2 package authors intro to reshape2
A general, modern framework for tidy data
- Short intro
- Full Paper

Reshaping your data

Melting and casting

The hare-lynx dataset

Stacking Data with melt

Stacking Data with melt

Stacking Data with melt

Stacking Data with melt

Stacking Data with melt

Stacking Data with melt

Spreading Data with dcast

Spreading Data with dcast

Spreading Data with dcast

Spreading Data with dcast

Activities

More reshaping

More reshaping

Melting missing values

When and when not to melt

Aggregation

Aggregation

Aggregation

Aggregating Wisely

Aggregating Wisely

Formulas with > 2 variables

Formulas with > 2 variables

Formulas with > 2 variables

Special forumula terms

Using …

Using .

Activities

More Resources

Stacking Data with `melt`

Stacking Data with `melt`

Stacking Data with `melt`

Stacking Data with `melt`

Stacking Data with `melt`

Stacking Data with `melt`

Spreading Data with `dcast`

Spreading Data with `dcast`

Spreading Data with `dcast`

Spreading Data with `dcast`