2018-05-04

Why shape matters

  • Reshaping refers to the act of changing the physical layout of a data set

  • Reshaping is useful and important for two main reasons:
    1. A proper physical layout eases the readability and interpretation of the data
    2. Different analysis tools expect data to be laid out in different physical formats
  • You could put all your data on a single 1 million character line, but that would impede working with it and understanding it

Data Semantics

To prepare for learning a reshaping tool set, it is useful to define a vocabulary we can use to describe the content of our datasets, independent of their physical form.

  • A data set is a collection of values, which can be grouped in two different ways: according to the variable a value belongs to, and according the the observation a value belongs to.

  • A variable contains all the values that measure the same underlying attribute
    • A variable could be all the values measuring reaction time, or all the values recording levels of an experimental factor, etc.
  • An observation contains all values measured from the same observational unit
    • An observational unit may be a person, an experimental trial, a day, etc.

Data Semantics

Lets make these terms more concrete with an example dataset.

The fictional dataset below shows the results of a study comparing two treatments for migraine headaches. It holds 18 total values, from 6 observations of 3 variables.

person treatment headaches
John Smith a NA
John Smith b 2
Jane Doe a 16
Jane Doe b 11
Mary Johnson a 2
Mary Johnson b 1

Data Semantics: Values

person treatment headaches
John Smith a NA
John Smith b 2
Jane Doe a 16
Jane Doe b 11
Mary Johnson a 2
Mary Johnson b 1

The values are everything that is not a row or column label. They are represented physically by each cell of the table.

Data Semantics: Variables

person treatment headaches
John Smith a NA
John Smith b 2
Jane Doe a 16
Jane Doe b 11
Mary Johnson a 2
Mary Johnson b 1

The variables are represented physically by the columns in the data set. Here we have 3 variables:

  1. person, with three possible values (John Smith, Mary Johnson, and Jane Doe).
  2. treatment, with two possible values (a and b).
  3. headache, with five values (or six,depending on how you think of the missing value (NA, 16, 3, 2, 11, 1).

Data Semantics: Variables

person treatment headaches
John Smith a NA
John Smith b 2
Jane Doe a 16
Jane Doe b 11
Mary Johnson a 2
Mary Johnson b 1

Variables can be naturally divided into two classes:

  1. Identifier (ID) variables define the unit of observation that measurements take place on.
    • ID variables are like the subscripts that identify what condition an observation is from (e.g., the \(_{ij}\) in \(Y_{ij}\)).
  2. Measured variables represent what is measured from that observational unit.
    • Measure variables are like the \(Y\) in \(Y_{ij}\).
  • For example, row 2 could be written \(Y_{person = John Smith, treatment = b} = 2\)

Data Semantics: Observations

person treatment headaches
John Smith a NA
John Smith b 2
Jane Doe a 16
Jane Doe b 11
Mary Johnson a 2
Mary Johnson b 1

Each observation made is represented physically by a row in the data frame.

This makes it easy to see that the single unit of observation in this dataset is a person within a treatment condition.

Data Semantics: Normal Form

person treatment headaches
John Smith a NA
John Smith b 2
Jane Doe a 16
Jane Doe b 11
Mary Johnson a 2
Mary Johnson b 1

When each column holds values of a single variable, each row holds all values from a single observation, and the table holds measurements from a single observational unit, the dataset is said to be normalized. This is the optimal layout for your data in most cases.

But normalized is far from the only format datasets may come in, and we will now examine two broader classes of data forms, "long" and "wide".

Data Semantics: Long Data

variable value
person John Smith
person John Smith
person Jane Doe
person Jane Doe
person Mary Johnson
person Mary Johnson
treatment a
treatment b
treatment a
treatment b
treatment a
treatment b
headaches NA
headaches 2
headaches 16
headaches 11
headaches 2
headaches 1

Long datasets have a column whose values identify which variable the values in other columns are associated with. This is sometimes called an "Entity-Attribute-Value" (EAV) data model.

This format differs from normalized data in that it does not preserve the "variables-as-columns" and "observations-as-rows relationship".

All the variables are stuffed into one column and single observations are now spread over multiple rows.

Data Semantics: Wide Data

Wide datasets split the values from one variable into multiple columns, according to the value of another variable it was observed with. Many of the example datasets in your book use this layout, as it is a very compact representation of data.

The "wide" version of the migraine dataset below splits the headache counts into two different columns, depending on which treatment condition the count was observed in.

person a b
Jane Doe 16 11
John Smith NA 2
Mary Johnson 2 1

This layout trades compactness for information. The headaches label is gone, and single observations are now combined into one row. Additionally, a and b are now used as distinct variables, instead of as different values of the "condition" variable.

Trouble defining 'Wide' and 'Long'

While wide data and long data do have concrete definitions, these terms are often used imprecisely.

They are often used in a relative sense, meaning to make a set of data longer or wider than it originally was, which may or may not correspond to making the data fit the actual definition of wide or long.

These labels can also be confusing because when you change data to a "wide" layout, it may or may not get literally wider. Additionally, a dataset may have many different layouts that each can be called wide or long form.

So, read carefully whenever you see these terms, and take care when using them yourself.