2018-05-04

Reshaping your data

Let get into reshaping some data with using the reshape2 package. Install that package now with install.packages("reshape2") and load it with library(reshape2).

A word of caution: don't confuse the reshape2 package with the reshape function in base R - they are unrelated.

Next, download and import the two datasets we will be practicing with: the hare-lynx trapping data set and the familiar loci/image/rhyme memory experiment data.

hares <- read.csv("http://wjhopper.github.io/psych640-labs/data/hare_lynx.csv")
memory <- read.csv("http://wjhopper.github.io/psych640-labs/data/memorydata.csv")

If you have not read the "Data Semantics" slides, I strongly recommend you go back and read those before continuing on.

Melting and casting

Changing a dataset's layout with tools from reshape2 revolves around two operations, melting and casting.

Melting is the process of transforming data into long format by moving variables from column headers into cells of the table. This is also called stacking, or gathering, and reshape2 provides the melt function to perform this task.

Casting is the process of transforming data into into wide(er) format, also called spreading. reshape2 provides the dcast and acast function to create reshaped data frames and arrays/matrices, respectively.

Reshaping often begins by melting data into long format, so we will begin by using the melt function with the hares-lynx data.

The hare-lynx dataset

This dataset records the number of hares and lynx trapped in North Canada from 1900 to 1920.

Year Hares Lynx
1900 30 4
1901 47.2 6.1
1902 70.2 9.8
1903 77.4 35.2
1904 36.3 59.4
1905 20.6 41.7

There are 3 variables here: Year, Animal Type, and Number Trapped.

This is a wide dataset, because it has a column for each variable. But it is non-normalized, because hare and lynx are used as variables, when they should be used in the cells of the table as the values the variable Animal can take.

Stacking Data with melt

The melt function takes values that are stored in column headers and moves them into rows in the data set, while preserving their relationships to other values. It has 4 key arguments:

  1. id.vars: A vector defining which columns of the data should be used as ID variables. This can be a numeric vector of column numbers, or character vector of column names.
  2. measure.vars: A vector defining which columns of the data hold measured variables. This can also be a numeric vector of column numbers, or character vector of column names.
  3. variable.name: A character vector used to name the new column in the reshaped data that holds the values shifted from the column headers down into the rows.
  4. value.name: A character vector used to name the new column in the reshaped data that holds the measured values.

Stacking Data with melt

To normalize our data, we need to shift Hare and Lynx from measured variables into values, so we set measure.vars = c("Hares","Lynx"). We also need to preserve the Year column as an ID variable, so we set id.vars = "Year".

Since "Hare" and "Lynx" are values the Animal Type variable takes on, we'll call the new column holding these values "Animal" by setting variable.name = "Animal".

The numeric values from the Hare and Lynx columns measure the "Number Trapped" variable, so we'll name the new column holding them "Trapped" by setting value.name = "Trapped".

hareMelted <- melt(hares, id.vars="Year",
                 measure.vars=c("Hares","Lynx"),
                 variable.name="Animal", value.name="Trapped")

Stacking Data with melt

Lets compare the structure of the data before and after melting.

hareMelted <- melt(hares, id.vars="Year",
                 measure.vars=c("Hares","Lynx"),
                 variable.name="Animal", value.name="Trapped")
Original data
Year Hares Lynx
1900 30 4
1901 47.2 6.1
1902 70.2 9.8
1903 77.4 35.2
1904 36.3 59.4
1905 20.6 41.7
Reshaped normalized data
Year Animal Trapped
1900 Hares 30
1900 Lynx 4
1901 Hares 47.2
1901 Lynx 6.1
1902 Hares 70.2
1902 Lynx 9.8

Stacking Data with melt

If you omit either the id.vars or the measure.vars arguments, melt assumes all columns that were not specified as ID variables or measure variable belong to the other set.

Here, we only specify the value of id.vars, and melt assumes all other columns (Hares and Lynx) hold measure variables.

omitMeasure <- melt(hares, id.vars = "Year", variable.name="Animal",
                    value.name="Trapped")
Year Animal Trapped
1900 Hares 30
1901 Hares 47.2
1902 Hares 70.2
1903 Hares 77.4
1904 Hares 36.3

Stacking Data with melt

Here, we only specify the value of measure.vars, and melt assumes all other columns (Year) should be used as ID variables. Thus, we get the same result as the previous slide.

omitID <- melt(hares, measure.vars = c("Hares","Lynx"))
Year variable value
1900 Hares 30
1901 Hares 47.2
1902 Hares 70.2
1903 Hares 77.4
1904 Hares 36.3

We can also see that when we omit values for the variable.name and value.name arguments, melt defaults to using the generic terms "variable" and "value" for those column names.

Stacking Data with melt

If you omit both id.vars and measure.vars, melt assumes all columns holding character data or factors are ID variables, and any other columns are measure variables.

omitID <- melt(hares)

No id variables; using all as measure variables

variable value
Year 1900
Year 1901
Year 1902
Year 1903

Since all columns in hares hold numeric data which get treated as measure variables, melt gives the warning seen above about having no ID variables.

Spreading Data with dcast

The dcast function is the inverse of the melt function: it takes a variable's values stored in rows of a dataset and making a new column for each one.

Using dcast can be tricky, since the form of the reshaped data is controlled with a grouping formula (like those used for aov). To cast your data correctly, remember these two rules:

  • On the left side of the ~ operator, list out the columns in the data frame which you want to remain as ID variables, each one separated with a +.

  • On the right side of the ~ operator, list the column in the data frame whose values should be "spread out" to form new variables.

Spreading Data with dcast

Lets use dcast to reshape our new normalized data set hareMelted back into its original shape.

Since the original data set had the Year variable as an ID variable, we'll put Year on the left side of the ~.

The original also used Hare and Lynx as separate measured variables instead of values. To recover this layout, we'll put the name of the column holding the Hare and Lynx values, in this case Animal, on the right side of the formula.

We also need to specify the name of the column holding our measured variable, so we'll set value.var = "Trapped". Here's the full call we'll use:

hareOrig <- dcast(hareMelted, formula = Year ~ Animal,
                  value.var="Trapped")

Spreading Data with dcast

Lets look at the resulting data frame - we got our original layout back!

hareOrig <- dcast(hareMelted, formula = Year ~ Animal,
                  value.var="Trapped")
hareMelted data frame
Year Animal Trapped
1900 Hares 30
1900 Lynx 4
1901 Hares 47.2
1901 Lynx 6.1
1902 Hares 70.2
1902 Lynx 9.8
Cast to original form
Year Hares Lynx
1900 30 4
1901 47.2 6.1
1902 70.2 9.8
1903 77.4 35.2
1904 36.3 59.4
1905 20.6 41.7

Spreading Data with dcast

The value.var argument is optional and if omitted, dcast will use some heuristics to guess which column holds the measured variable in your data.

  1. If there is a column named "value" or "(all)" in your data frame, use that column as the measure variable.
  2. Otherwise, guess that the final column is the measure variable.

Usually your measure variable should be the last column, but if you have multiple measure variables in the same data frame, these heuristics will not necessarily give you the desired result.

Activities

  1. Using dcast, reshape the data frame hareMelted to a data frame which has Animal Type as an ID variable, and the years 1900 to 1920 as measured variables.
  2. Download this dataset, which measures the quarterly earnings per Johnson & Johnson share from 1960 to 1980.
JJ <- read.csv("http://wjhopper.github.io/psych640-labs/data/JJ.csv")

Reshape it to use fiscal quarter as an ID variable instead of using Qtr1, Qtr2, Qtr3, and Qtr4 as measure variables.

  1. Add an observation ID to the InsectSprays data frame:
InsectSprays$obsID <- rep(1:12,nrow(InsectSprays)/12)

Then reshape the InsectSprays data frame to use the values from the spray column as individual measure variables.

NB: Be specific about which column is the measured variable.

More reshaping

Now we'll work with the loci/image/rhyme memory data set, which I've modified to include a second between-subject factor, presentation duration.

memory <- read.csv("http://wjhopper.github.io/psych640-labs/data/memorydata.csv")
Subject Method Duration Score
1 Control 2 11
1 Image 2 NA
1 Loci 2 NA
1 Rhyme 2 16
2 Control 6 NA
2 Image 6 NA

This data set is normalized, and holds 128 values across 40 unique observations from 3 ID variables (Subject, Method, and Duration) and 1 measure variable (Score). The observational unit is a subject in a condition of method and condition of duration.

More reshaping

Lets say we wanted to de-normalize this dataset to only use subject and method as ID variables. This layout would use more rows to hold the data, so we need the melt function to "lengthen" or "stack" the data.

memLong <- melt(memory, id.vars=c("Subject","Method"),
                measure.vars=c("Duration","Score"))
Subject Method variable value
1 Control Score 11
1 Control Duration 2
1 Image Score NA
1 Image Duration 2
1 Loci Score NA
1 Loci Duration 2
1 Rhyme Score 16
1 Rhyme Duration 2

Melting missing values

Subject Method variable value
1 Control Score 11
1 Control Duration 2
1 Image Score NA
1 Image Duration 2
1 Loci Score NA
1 Loci Duration 2

melt also preserves any missing values in the measure variables by default. You should always keep NA's when reshaping your data if they are explicitly missing values, meaning a value should have been sampled but was not (e.g., a survey question was not answered).

If a value is implictly missing, like an NA from a male respondent in a survey that asked about pregnancy history, then set the na.rm argument to TRUE.

When and when not to melt

Subject Method variable value
1 Control Score 11
1 Control Duration 2
1 Image Score NA
1 Image Duration 2
1 Loci Score NA
1 Loci Duration 2

The usefulness of different layouts is contextual, but having the dataset in this form is not useful here because it mixes values from different variables in the same column (i.e., mixing accuracy scores and presentation durations in the values column).

Instead, the values of the presentation duration variable better aid analysis and interpretation when used as values in an ID variable.

Aggregation

Lets use dcast to reshape our long form data into a form that uses each of the four practice methods as measure variables. Using the grouping formula Subject ~ Method to reshape the data gives an unexpected result:

memWide <- dcast(memLong, Subject ~ Method, value.var="value")
Subject Control Image Loci Rhyme
1 2 2 2 2
2 2 2 2 2
3 2 2 2 2
4 2 2 2 2

We see the number 2 instead of scores and durations because our formula defined groups of more than one observation. dcast handles this situation by applying a function to summarize the observations with a single value (the default aggregation function is length).

Aggregation

To better understand why this happens, lets take a closer look at our formula. Subject ~ Method defines Subject as the only ID variable, and Method as the only measured variable.

Thus, we can read the formula as saying "For each subject, make a group of values from the value column for each unique value in Method column".

Outlining the groups our forumla defines
Subject Method variable value
1 Control Score 11
1 Control Duration 2
1 Image Score NA
1 Image Duration 2
Groups of observations aggregated by length()
Subject Control Image Loci Rhyme
1 2 2 2 2
2 2 2 2 2
3 2 2 2 2

These groups form the values for the new measure variables, but since the groups are too large they are first summarized with length, yielding our result of all twos.

Aggregation

If you really did want to layout your this way, but aren't interested in reporting the number of observation in each group, you have two options:

  1. Supply the aggregation function you wish to use to summarize the groups of values by setting the fun.aggregate argument (e.g. fun.aggregate = mean).
    • Only do this when it makes sense to combine values across variables. We should not do this, as combining duration and accuracy values doesn't make sense.
  2. Subset the data to remove values from variables you're not interested in
    • e.g.. memLong[memLong$variable != "Duration",]

Aggregating Wisely

It would be useful to know the average number of items recalled in each condition. We can use dcast with the appropriate grouping formula and aggregation function to find this out!

dcast(memory, Method ~ Duration, fun.aggregate = mean) #mean not in quotes!
Method 2 6
Control 6.5 NA
Image NA NA
Loci NA 10
Rhyme 10.5 10.5

Our formula says "Take the scores at each unique value of Method and group them by their Duration". This makes 8 groups of 4 observations, and we set fun.aggregate = mean to summarize these four values with a single value, the mean.

Aggregating Wisely

If we would like to see the mean of each condition, ignoring the fact that some observations were not made when they should have been, we can include na.rm = TRUE.

dcast(memory, Method ~ Duration, fun.aggregate = mean, na.rm =TRUE)
Method 2 6
Control 6.5 5
Image 8.333 9
Loci 14.67 10
Rhyme 10.5 10.5

Formulas with > 2 variables

Alternatively, you may not have meant to aggregate the data at all and instead misspecified the grouping formula.

  • To avoid aggregating, you must include either all the columns, or all but 1 of the columns, in the grouping formula (making 3 the minimum number to include for this dataset).

  • Exactly where variables are used in the formula will affect the resulting layout
    • All variables before the ~ will simply be used as ID variables.
    • When multiple variables are used after the ~, the values from each variable are concatenated with each value from all other variables to form the column headers.

Formulas with > 2 variables

Subject & Duration as ID vars

dcast(memLong, Subject + variable ~ Method)
Subject variable Control Image Loci Rhyme
1 Duration 2 2 2 2
1 Score 11 NA NA 16
2 Duration 6 6 6 6
2 Score NA NA 9 14
3 Duration 2 2 2 2
3 Score 4 16 18 9
4 Duration 6 6 6 6
4 Score 2 13 8 16

Formulas with > 2 variables

Method and Duration values get combined in the column headers

dcast(memLong, Subject ~ Method + variable)
Table continues below
Subject Control_Duration Control_Score Image_Duration Image_Score Loci_Duration
1 2 11 2 NA 2
2 6 NA 6 NA 6
3 2 4 2 16 2
4 6 2 6 13 6
Loci_Score Rhyme_Duration Rhyme_Score
NA 2 16
9 6 14
18 2 9
8 6 16

Special forumula terms

There are two special terms you can include in your grouping formulas in place of column names: ... and .

  1. The ... term is shorthand for "all other variables", so using it makes your formula mean "group by everything else"
  2. The . term means "no variables", so using it makes your formula mean "group by nothing"

Using …

Casting the memLong data frame using Subject + variable ~ Method and ... ~ Method give the same results because Subject and variable are the only remaining columns after specifying value.var as value.

dcast(memLong, ... ~ Method, value.var = 'value')
Subject variable Control Image Loci Rhyme
1 Duration 2 2 2 2
1 Score 11 NA NA 16
2 Duration 6 6 6 6
dcast(memLong, Subject + variable ~ Method )
Subject variable Control Image Loci Rhyme
1 Duration 2 2 2 2
1 Score 11 NA NA 16
2 Duration 6 6 6 6

Using .

Here we use the special term . to write a formula telling dcast to group all the the measured values for method of practice by nothing else!

dcast(memLong, . ~ Method )
. Control Image Loci Rhyme
. 16 16 16 16

Since there are multiple measurements for each methods, dcast applies the defaults aggregation function length to boil the observations down to a single value. The value 16 means 16 measurements were taken for each practice method.

Activities

  1. Reshape the airquality dataset to only use month and day as ID variables. Should missing values be dropped or preserved?
  2. Using the results for #1, layout the airquality dataset using the Months as measure variables, and use all other variables as ID variables.
  3. Use dcast on the built in ToothGrowth data frame, find the standard deviation of the recorded tooth lengths at every combination of supplement type and dose level.
  4. Using melt and then dcast on the built in DNase dataset, find the average protein concentration and optical density on each run of the DNA assay. The final data frame should have Run as an ID variable, and conc and density as measure variables.

More Resources