2018-05-04

What is dplyr?

The dplyr package is quickly becoming the de-facto standard for common tasks that involve manipulating data frames. Some of its biggest advantages are:

  • A streamlined and unified interface for tasks commonly performed together.
  • More syntactically and semantically intuitive functions than the tools in base R
  • Often much faster!
  • The ability to manipulate data stored in databases (e.g. MySQL, SQLite, PostgreSQL) with the same language you use for 'ordinary' data in R (i.e., data frames).

This is not to say that it replaces or makes base R syntax obsolete, but it is a great tool to have at your disposal.

Functions == Verbs in dplyr

dplyr provides some basic "verbs", i.e., functions that correspond to the most common data manipulation tasks.

This model helps build a bridge between your thoughts (i.e., what you want to do to the data, like "I want to select only observations from the first time point") into code. For example:

  • "I want to sort the rows" –> arrange()
  • "I want to select only some rows" –> filter()
  • "I want to select only some columns" –> select()
  • "I want to find the unique rows" –> distinct()
  • "I want to create a new column" –> mutate()
  • "I want to aggregate the data" –> summarise()

Getting Started

We'll start our tour by installing the dplyr package and loading a dataset to use in our examples.

install.packages('dplyr')
library(dplyr)
cropYield <- read.csv("http://wjhopper.github.io/psych640-labs/data/crops.csv")
head(cropYield)
##   Field   Till    Fert Yield
## 1     1 Chisel   Broad   119
## 2     1 Chisel    Deep   130
## 3     1 Chisel Surface   123
## 4     2 Chisel   Broad   135
## 5     2 Chisel    Deep   148
## 6     2 Chisel Surface   134

The cropYield data frame has 99 observations of 4 variables: Field, Till, Fert, and Yield.

The arrange function

The arrange function sorts the rows of a data frame based on the values of one or more variables. Lets use it to sort the rows of the cropYield data frame by fertilizer type.

arrange expects the first argument to be the data frame to sort, and uses any remaining arguments as the names of the variables to sort by.

arrange(cropYield, Fert) # Column names are unquoted!!
##    Field    Till    Fert Yield
## 1      1  Chisel   Broad   119
## 2      2  Chisel   Broad   135
## 3      3  Chisel   Broad   140
## 4      4  Chisel   Broad   126
## 5      5  Chisel   Broad   128
##  [ reached getOption("max.print") -- omitted 94 rows ]

The arrange function

If you ask arrange to sort by multiple variables, the rows are sorted hierarchically according to the order of the variables listed in the function call.

arrange(cropYield, Fert, Yield)
##    Field    Till    Fert Yield
## 1      7  Chisel   Broad   104
## 2      8  Chisel   Broad   108
## 3     26   Ridge   Broad   110
## 4     12  Chisel   Broad   114
## 5     23   Ridge   Broad   115
##  [ reached getOption("max.print") -- omitted 94 rows ]

The arrange function

You can also sort by reverse alphabetical or numerically descending order by using the desc function (short for descending) inside the call to arrange.

arrange(cropYield, Fert, desc(Yield)) # desc only works inside arrange
##    Field    Till    Fert Yield
## 1     16 Moldbrd   Broad   160
## 2     24   Ridge   Broad   156
## 3     33   Ridge   Broad   153
## 4     20 Moldbrd   Broad   147
## 5     11  Chisel   Broad   146
##  [ reached getOption("max.print") -- omitted 94 rows ]

The filter function

The filter function selects a subset of rows from the data frame based on logical tests, similar to the [ operator.

filter also expects the first argument to be a data frame. Any other arguments must be expressions that return logical vectors, and filter returns only the rows where the logical conditions are met.

# Find the rows where the tilling method is ridge
filter(cropYield, Till == "Ridge")
##    Field  Till    Fert Yield
## 1     23 Ridge   Broad   115
## 2     23 Ridge    Deep   142
## 3     23 Ridge Surface   148
## 4     24 Ridge   Broad   156
##  [ reached getOption("max.print") -- omitted 29 rows ]

The filter function

If you list multiple conditions, they are combined with the & operator, meaning that a row must meet both conditions to be included in the output.

Here, a row must have it's Till value equal to Ridge, and it's Fert value equal to Deep to be part of the output.

filter(cropYield, Till == "Ridge", Fert == "Deep")
##    Field  Till Fert Yield
## 1     23 Ridge Deep   142
## 2     24 Ridge Deep   174
## 3     25 Ridge Deep   144
## 4     26 Ridge Deep   128
## 5     27 Ridge Deep   153
## 6     28 Ridge Deep   165
##  [ reached getOption("max.print") -- omitted 5 rows ]

The filter function

If you want to include a row if it meets any given condition (a logical OR statement), you must combine the conditions yourself with the | operator.

filter(cropYield, Till == "Ridge" | Fert == "Deep") 
##    Field    Till    Fert Yield
## 1      1  Chisel    Deep   130
## 2      2  Chisel    Deep   148
## 3      3  Chisel    Deep   146
## 4      4  Chisel    Deep   132
## 5      5  Chisel    Deep   141
## 6      6  Chisel    Deep   130
##  [ reached getOption("max.print") -- omitted 49 rows ]

The filter function

If you want to include rows where the value of a variable is a member of a larger subset (e.g., rows where x is 1 or 2, instead of just 1), you can use the %in% operator.

Here we filter out the rows where the Till variable has a value in the set defined by the character vector c("Ridge","Chisel").

filter(cropYield, Till %in% c("Ridge", "Chisel"))
##    Field   Till    Fert Yield
## 1      1 Chisel   Broad   119
## 2      1 Chisel    Deep   130
## 3      1 Chisel Surface   123
## 4      2 Chisel   Broad   135
## 5      2 Chisel    Deep   148
##  [ reached getOption("max.print") -- omitted 64 rows ]

A note on using %in%

Sometimes it makes sense to use the %in% operator to with a set of just one value. A good example is when you're filtering on a variable which has NA values. If you try to use the == operator to filter the values, your results will have the rows which meet this criterion, AND all the NA values!

This occurs because R treats NA as meaning literally missing, thus there is no second value to compare to, and thus the result of the logical test is itself missing.

But if you use the %in% operator, the NA values will be judged "not in the set", and not included in the results.

c(NA == 1, NA %in% 1) # Logical Weirdness, Exhibit A!
## [1]    NA FALSE

The select function

The select function allows you to pare down your data set by only keeping the columns that you specify, and also allows you to rename and reorder the selected columns on the fly.

colnames(cropYield)
## [1] "Field" "Till"  "Fert"  "Yield"
# Existing column name on the right of the =, new name on the left
# Just like assigning a new variable!
select(cropYield, fertilizer = Fert, bushels = Yield, Till)
##    fertilizer bushels    Till
## 1       Broad     119  Chisel
## 2        Deep     130  Chisel
## 3     Surface     123  Chisel
## 4       Broad     135  Chisel
##  [ reached getOption("max.print") -- omitted 95 rows ]

The select function

The select function even allows you to select a continuous range of columns using the start and end column names, separated with the : operator, just like a numeric vector!!!

# So awesome! But, you can't rename at the same time with this syntax
select(cropYield, Till:Yield)
##       Till    Fert Yield
## 1   Chisel   Broad   119
## 2   Chisel    Deep   130
## 3   Chisel Surface   123
## 4   Chisel   Broad   135
## 5   Chisel    Deep   148
## 6   Chisel Surface   134
##  [ reached getOption("max.print") -- omitted 93 rows ]

The select function

You can even specify just the columns you don't want to keep, which is useful for very wide data frames where you want to discard just one or two columns

# The - says to drop this column
select(cropYield, -Field)
##       Till    Fert Yield
## 1   Chisel   Broad   119
## 2   Chisel    Deep   130
## 3   Chisel Surface   123
## 4   Chisel   Broad   135
## 5   Chisel    Deep   148
## 6   Chisel Surface   134
##  [ reached getOption("max.print") -- omitted 93 rows ]

The rename function

You can use the rename function if you want to change any column names without removing any columns.

# Existing column name on the right of the =, new name on the left 
rename(cropYield, fieldID = Field)
##    fieldID    Till    Fert Yield
## 1        1  Chisel   Broad   119
## 2        1  Chisel    Deep   130
## 3        1  Chisel Surface   123
## 4        2  Chisel   Broad   135
## 5        2  Chisel    Deep   148
## 6        2  Chisel Surface   134
##  [ reached getOption("max.print") -- omitted 93 rows ]

The distinct function

The distinct function is similar to the unique function, but instead of returning the unique values in a vector, distinct returns only the unique rows of a data frame.

Lets use it to discover all the unique combinations of Till and Fert contained in this dataset.

distinct(select(cropYield, Till, Fert))
##      Till    Fert
## 1  Chisel   Broad
## 2  Chisel    Deep
## 3  Chisel Surface
## 4 Moldbrd   Broad
## 5 Moldbrd    Deep
## 6 Moldbrd Surface
## 7   Ridge   Broad
## 8   Ridge    Deep
## 9   Ridge Surface

The mutate function

You can use the mutate function to create new variables in your data frame.

Its especially useful for creating new variables based on existing ones, and changing/adding multiple columns at the same time.

mutate(cropYield, Field = factor(Field), yeildSq = Yield^2)
##    Field    Till    Fert Yield yeildSq
## 1      1  Chisel   Broad   119   14161
## 2      1  Chisel    Deep   130   16900
## 3      1  Chisel Surface   123   15129
## 4      2  Chisel   Broad   135   18225
##  [ reached getOption("max.print") -- omitted 95 rows ]

The mutate function

A cool feature of the mutate function is that you can use new variables immediately after you create them, within the same call to mutate!!

# We create  yieldRoot, and then use it in the same function call
mutate(cropYield, yeildRoot = Yield^1/2, bigYield = yeildRoot + Yield)
##    Field    Till    Fert Yield yeildRoot bigYield
## 1      1  Chisel   Broad   119      59.5    178.5
## 2      1  Chisel    Deep   130      65.0    195.0
## 3      1  Chisel Surface   123      61.5    184.5
## 4      2  Chisel   Broad   135      67.5    202.5
##  [ reached getOption("max.print") -- omitted 95 rows ]

The summarise function

The summarise function is useful for applying functions that reduce your variables down to a single value.

summarise(cropYield, avgYield = mean(Yield))
##   avgYield
## 1 139.9394

This may not seem incredibly useful on its own, but allows us to segue into what is perhaps the biggest reason to use dplyr.

The Killer Feature of dplyr

Unless you are struggling to work with a remote MySQL server, or waiting hours for your code to run on data frames with millions of observations, it might not seem worth it to "relearn" how to do everything the "dplyr way".

But dplyr has one feature that makes it worth every minute of learning:

The ability to define persistent groups in your data frame, and apply functions to each individual group, instead of the entire data frame.

Group-wise operations, the hard way

Consider this method of finding the average crop yield for each of the unique tilling and fertilizer groups in our dataset.

ChiselBroad <- mean(cropYield$Yield[cropYield$Till == 'Chisel'
                              & cropYield$Fert == 'Broad'])
ChiselDeep <- mean(cropYield$Yield[cropYield$Till == 'Chisel'
                             & cropYield$Fert == 'Deep'])
ChiselSurface <- mean(cropYield$Yield[cropYield$Till == 'Chisel'
                                & cropYield$Fert == 'Surface'])
MoldbrdBroad <- mean(cropYield$Yield[cropYield$Till == 'Moldbrd'
                               & cropYield$Fert == 'Broad'])
# And so on, and so forth into infinity...

This is verbose, tedious, prone to typos, and doesn't scale to large datasets with many groups. A for loop would help, but there's an even better way…

The group_by function

dplyr provides a function called group_by you can use to specify variables in your data frame whose values (or combinations of values) define groups of observations.

For example, we can use group_by to group the larger cropYield data frame into 9 smaller groups, one for each of the tilling and fertilizer method combinations.

cropsGrouped <- group_by(cropYield, Till, Fert) # create the grouped dataset

The group_by function

The group_by function doesn't change the external appearance of the data frame; the only noticeable change is the addition of the line Groups: Till, Fert [9] when the data frame is printed.

cropsGrouped # Inspect the grouped dataset
## # A tibble: 99 x 4
## # Groups:   Till, Fert [9]
##    Field Till   Fert    Yield
##    <int> <fct>  <fct>   <int>
##  1     1 Chisel Broad     119
##  2     1 Chisel Deep      130
##  3     1 Chisel Surface   123
##  4     2 Chisel Broad     135
##  5     2 Chisel Deep      148
##  6     2 Chisel Surface   134
##  7     3 Chisel Broad     140
##  8     3 Chisel Deep      146
##  9     3 Chisel Surface   142
## 10     4 Chisel Broad     126
## # ... with 89 more rows

This line tells you which columns the groups are based on, and how many individual groups they form.

Group-wise operations, the easy way

When dplyr functions are applied to a grouped data frame, the operations you perform are no longer applied to the entire data frame.

Instead, they are applied to each group of observations individually, without any change to the function call.

This technique makes finding the mean of each individual tilling/fertilizer group incredibly simple instead of complicated and tedious.

cropsGrouped <- group_by(cropYield, Till, Fert) # easy peasy
grpMeans <- summarise(cropsGrouped, avgYeild = mean(Yield)) # lemon squeezy

Group-wise operations, the easy way

Viola!

grpMeans
## # A tibble: 9 x 3
## # Groups:   Till [?]
##   Till    Fert    avgYeild
##   <fct>   <fct>      <dbl>
## 1 Chisel  Broad       124.
## 2 Chisel  Deep        134.
## 3 Chisel  Surface     131.
## 4 Moldbrd Broad       141.
## 5 Moldbrd Deep        151 
## 6 Moldbrd Surface     146 
## 7 Ridge   Broad       128 
## 8 Ridge   Deep        155.
## 9 Ridge   Surface     153.

Group-wise operations

Lets take a closer look at meta-data about our summary data frame.

## # A tibble: 9 x 3
## # Groups:   Till [?]
##   Till    Fert    avgYeild
##   <fct>   <fct>      <dbl>
## 1 Chisel  Broad       124.
## 2 Chisel  Deep        134.
## 3 Chisel  Surface     131.
## 4 Moldbrd Broad       141.
## 5 Moldbrd Deep        151 
## 6 Moldbrd Surface     146 
## 7 Ridge   Broad       128 
## 8 Ridge   Deep        155.
## 9 Ridge   Surface     153.

The Groups line at the top has changed: now our data frame is only grouped by Till, and the group size is now [?].

This is because summarize always "peels off" the final grouping variable, because it doesn't make sense to continue grouping by the same columns. If you did, each group size would be one, because you just summarized each group down to one value.

Group-wise operations

You can use the n function, with no arguments, inside calls to summarise and mutate to count the number of observations in a group.

summarise(cropsGrouped, grpSize = n())
## # A tibble: 9 x 3
## # Groups:   Till [?]
##   Till    Fert    grpSize
##   <fct>   <fct>     <int>
## 1 Chisel  Broad        12
## 2 Chisel  Deep         12
## 3 Chisel  Surface      12
## 4 Moldbrd Broad        10
## 5 Moldbrd Deep         10
## 6 Moldbrd Surface      10
## 7 Ridge   Broad        11
## 8 Ridge   Deep         11
## 9 Ridge   Surface      11

Ungrouping

If you want to remove the grouping you've applied to a data frame, just use the ungroup function to reset it back to a "normal" data frame. sd

ungroup(cropsGrouped) # no more "Groups" line
## # A tibble: 99 x 4
##   Field Till   Fert    Yield
##   <int> <fct>  <fct>   <int>
## 1     1 Chisel Broad     119
## 2     1 Chisel Deep      130
## 3     1 Chisel Surface   123
## 4     2 Chisel Broad     135
## 5     2 Chisel Deep      148
## 6     2 Chisel Surface   134
## 7     3 Chisel Broad     140
## 8     3 Chisel Deep      146
## # ... with 91 more rows

You can check a data frames grouped/ungrouped status with the is.grouped function.

Activity

Use the dplyr functions to:

  • Find the mean and standard error of the crop yields recorded for each field
  • Create a new variable in the data frame that gives the overall rank of each field in terms of average crop yield (see the ?rank function).