Functional programming in R: The apply functions

2018-05-04

What is functional programming?

Functional programming means (at least) 2 things:

Expressions do not have side effects (i.e., don't create or change any variables without you explicitly asking for them to)
Functions are "first-class"
- Can be manipulated like any other R object
- Can be passed as arguments to other functions
- Can be returned as a result of a function

Don't be scared though: You've been doing functional programming all along without knowing it, because at it's core, R is a functional programming language!

We're going to focus on the second tenant today, as "first class" functions are the underpinning of the *apply family of functions in R.

First-Class Functions

Being "First Class" simply means that functions can be treated as "just another object". They can be created, removed, passed to other environments and functions, and even have their names changed at will.

To illustrate how flexible functions are, let's write a new function which has two inputs: a numeric vector, and a function that will operate on that numeric vector. We'll assign this function to the name f.

f <- function(x, fun) {
  fun(x) 
}

Yes, this is a function that takes a function as input!

First-Class Functions

Lets use f to take the mean of some random binomial draws, and compare our results with "standard" syntax.

Note that we supply only the name of the mean function only, without any parenthesis following it.

f <- function(x, fun) {
  fun(x)
}
a <- rbinom(10, 50, .35) # How many "heads" occur in 50 "flips", 10 times.
x <- f(x = a, fun = mean) # mean, not mean()!
y <- mean(a)
identical(x, y)

## [1] TRUE

Your first apply function!

Congratulations, you just wrote an apply function! In other words, your function takes a data structure and a function as inputs, applies that function to the data structure, and returns the result. Functions which pattern after this basic archetype are called apply functions.

If you're thinking "Why would anyone go through this contortionist act to take the mean?", you wouldn't be off base.

The mean function, like the majority of the functions in base R, is vector-oriented, meaning it is designed to operate on a vector of values. So if we want the mean of the values in our vector, we can call mean(a) and that's it!

But consider for a moment, how often do you have real data just lying around just as a vector?

Escape from Flatland

More often, you'll have a list or data frame composed of many vectors. But, you'll still find yourself needing to use base R functions which take a vector as input.

And often, you'll find yourself wanting to use the same function on each of the elements in your data structure, whether it be a vector, list or data frame.

It's in these cases that the *apply family of functions is useful.

The *apply functions

We'll be looking the 4 most frequently used *apply functions, listed below:

lapply: The "Apply and List" function
- lapply(X, FUN, ...)
sapply: The "Apply and Simplify" function
- sapply(X, FUN, ..., simplify = TRUE)
vapply: The "Apply and Match to Template" function
- vapply(X, FUN, FUN.VALUE, ..., )
apply: The "Apply to Array Dimension" function
- apply(X, MARGIN, FUN, ...)

NB: The help pages for these functions are obtuse, even for R help pages. Hopefully these slides suffice as an intuitive guide, until you reach a high enough plane of enlightenment to read the documentation.

The X Files FUNS

The formal argument lists for these functions have an unmistakable pattern: Each one has arguments named X and FUN.

FUN, as you might imagine, is the function you wish to apply. And X is the data structure whose individual elements will have that function applied to them.

X can be any object that has elements (which is pretty much every R object!). So, the keys to understanding the behavior of *apply functions are:

Understanding what things are considered to be "the elements" for any given data structure X
Knowing the size of input that FUN consumes (e.g., scalar value, vector, data frame, etc.)

Elemental Heirarchy of Data Structures

This table describes what the elements of the 4 most common data structures are, in order of flexibility.

Data Structure	Elements	Notes
Atomic Vector	Identically typed "Primitives" (e.g, individual integers or characters )	Called "Atomic" because it's elements are the smallest possible units in the R language
Matrix/Array	Identically typed "Primitives"	Matrices/Arrays are just Vectors with a "dim" attribute (e.g, `dim = c(2,4,2)` means 2 rows, 3 columns, 2 pages)
Data Frame	Atomic vectors	Data Frames are special cases of lists, where each element has the same length
List	Any R object	Lists are still vectors, but their elements are allowed to be anything!

lapply: The "Apply and List" Function

The lapply function applies the given function to each element of the given data structure, and always returns its results as a list.

The length of the resulting list (i.e., the number of elements it holds) is equal to the number of elements in the input data structure X.

Here, we supply a two-element numeric vector as the input data structure X, and apply the seq function to each of its elements.

lapply(c(8, 6), seq)

## [[1]]
## [1] 1 2 3 4 5 6 7 8
## 
## [[2]]
## [1] 1 2 3 4 5 6

lapply: The "Apply and List" Function

Under the hood, lapply splits the vector into individual values, and provides each of these values as the first argument to seq.

In other words, it calls seq(8) and seq(6) and assembles the results of each function call as a list.

x <- lapply(c(8, 6), seq)
y <- list(seq(8), seq(6)) # doing it "manually"
identical(x,y)

## [1] TRUE

The advantage of the lapply approach is that you don't have to manually call the function n times when you have a vector with n elements.

lapply: The for-loop alternative

You may be thinking to yourself "Hmm, lapply acts a lot like a for-loop…".

Well, you're right! In both situations, you're iterating over a series of specified values, and those values serve to control the input to other functions.

So, which should be preferred? An apply function, or a for loop?

In reality, the question has no definitively right or wrong answer, and a careful consideration of the task at hand should determine which you should use.

For instance, apply functions don't require lots of boilerplate around them, and automatically create the output structure for you. But, loops work better when you need to replace values in an existing data structure.

lapply and flexibility

One reason users often prefer loops over using apply statements is that the function call in apply statements feels "rigid".

By default, each element of the data structure is used as the first argument for the specified function (not unlike pipes). On the other hand, you as the user are always in complete control of any function calls you explicitly write out inside a for loop.

However, this is a false dichotomy, and there are several simple techniques that give you complete control over the arguments to the function call inside the *apply functions.

Using the … ellipsis

Each *apply functions includes the ellipsis (the ...) in their formal argument lists.

The ellipsis is used to pass any named arguments not matching the *apply function's native named arguments (i.e., any arguments not named X or FUN) to the function you've specified.

To demonstrate, we'll include an argument named from in our call to lapply(c(8,6), seq). The value of the from argument will be passed to seq and change the starting value of the sequence.

lapply(c(8,6), seq, from = 3) # from 3 to 8, and from 3 to 6

## [[1]]
## [1] 3 4 5 6 7 8
## 
## [[2]]
## [1] 3 4 5 6

Using an anonymous function

Another useful way to control the arguments to your function are to supply anonymous functions as the FUN argument.

Anonymous functions are normal function declarations that are not assigned a name in your environment. They are declared "in-line" (i.e., inside another function call), and used as "one-off" tools when you need to supply a function as an argument, but have no other use for that specific function.

Here we'll use an anonymous function to get seqeuncecs starting at 15, and going to 8 and 6, respectively.

lapply(c(8,6), function(x) seq(from = 15, to = x)) # 6 and 8 become "x"

## [[1]]
## [1] 15 14 13 12 11 10  9  8
## 
## [[2]]
##  [1] 15 14 13 12 11 10  9  8  7  6

Practical lapply

Needing to create factor variables instead of numeric variables before using aov to perform an ANOVA often leads to the following style of code:

data$Var1 <- as.factor(data$Var1)
data$Var2 <- as.factor(data$Var2)
data$Var3 <- as.factor(data$Var3)
data$Var4 <- as.factor(data$Var4)

Code like this is "brittle", in that it is repetitive and error prone (4 chances to make a mistake).

Practical lapply

Because a data frame is actually a list, we can add or overwrite columns with elements from a list, provided that list has equal-length vectors as its elements. Now we can replace that brittle, repetetive code with a single expression!

To see how, lets start by reading in the marathon dataset.

marathon <- read.csv("http://wjhopper.github.io/psych640-labs/data/marathon.csv")

Practical lapply

As you may recall, the marathon dataset has the race results and demographic information about the runners in the 2009 Credit Union Cherry Blossom Ten Mile Run in Washington D.C. If we were analyzing these data, it might be useful to have our demographic variables as factors.

We can convert the 8 demographic variables by l-applying factor to each column (because the columns are the "elements" of a data frame), and overwriting those columns in the marathon data frame with the resulting list (which works because a data frame is secretly a list!).

marathon[,4:11] <- lapply(marathon[,4:11], factor) # "factor" columns 4:11
class(marathon[,6]) # Lets check on one

## [1] "factor"

sapply: The "Apply and Simplify" function

The syntax and behavior of the sapply function is identical to the lapply function, except in one regard.

By default, sapply will try to "simplify" its output, meaning that if possible, it will return its results as a vector, matrix, or array. lapply on the other hand, always returns a list.

You can control how sapply tries to perform its simplification with the SIMPLIFY argument. The default is TRUE, to perform simplification, and it can be set to FALSE to disable it, or "array" if you want to force it the output to not be a vector.

sapply: The "Apply and Simplify" function

To see this simplification in action, contrast the following two examples.

We can find out what the unique values in the "city", "state", and "country" variables are, by using sapply to call the unique function on each of those columns.

x <- sapply(marathon[,c("city","state","country")], unique)
str(x)

## List of 3
##  $ city   : Factor w/ 1259 levels "Aberdeen","Abingdon",..: 1027 743 48 375 1172 705 1073 437 1071 30 ...
##  $ state  : Factor w/ 51 levels "AE","AK","AL",..: 21 46 9 34 20 35 38 27 30 1 ...
##  $ country: Factor w/ 13 levels "AUS","CAN","ETH",..: 13 2 7 3 12 4 9 5 10 6 ...

These results have to be represented in a list because each column has a different number of unique values (1259 unique cities, 51 unique states, and 13 unique countries).

sapply: The "Apply and Simplify" function

If all we were interesting in was finding out how many unique values there were in each variable, we could do the following:

x <- sapply(X = marathon[,c("city","state","country")],
            FUN = function(x) length(unique(x)))
x

##    city   state country 
##    1259      51      13

This time, our output is a vector.

This is because calling length(unique(x)) on each column x produces the same type of result each time: a single numeric value. sapply then assembles the results using the simplest data structure possible for combining lots of numeric values: an atomic vector!

vapply: The "Apply and Match to Template" function

The vapply function is a "safe" version of sapply that is preferable in situations when your results MUST be in a specific data structure.

The FUN.VALUE argument to vapply allows you to define a "template" result, and each individual result must conform to the template or vapply throws an error.

Think of this as a safety feature: You can specify a-priori what form the results of applying this function should be, and if they don't match, something has gone terribly wrong.

vapply: The "Apply and Match to Template" function

If we wanted to enforce that the results of calling length(unique(x)) on each element of the input must return a single numeric value, you would do the following:

The numeric(1) expression is our template, and vapply checks to make sure each result is a numeric of length 1.

x <- vapply(X = marathon[,c("city","state","country")],
            FUN = function(x) length(unique(x)),
            FUN.VALUE = numeric(1))
print(x)

##    city   state country 
##    1259      51      13

Remember, our template specifies the size and type of result that is expected, not its value.

When template matching fails

If the result of calling FUN on an element of our data structure doesn't pass validation (i.e., it is not the same size and type as the template), you get an error:

Here, our data structure is a list of vectors with 2 elements each, and our template specifies that the given function must return a numeric of length 1. But because our given function just returns its input (in this case, vectors with 2 elements), this call will fail.

x <- vapply(X = list(c(10, 25), c(3, 11)),
            FUN = function(x) return(x), # just returns its input!
            FUN.VALUE = numeric(1))

## Error in vapply(X = list(c(10, 25), c(3, 11)), FUN = function(x) return(x), : values must be length 1,
##  but FUN(X[[1]]) result is length 2

apply: "Apply to Array Dimension" function

As was mentioned in the "notes" columns in the table about data structure elements, matrices and arrays are just vectors with a special dim attribute that says hows the elements should be arranged into rows, columns, pages, etc.

So, we can use lapply/sapply/vapply to operate on each individual "cell" in matrices and arrays, just as though it were a vector without dimensionality.

But in practice, we never use a matrix or array in place of a vector, but instead give the different rows and columns semantic significance (e.g., each row might represent a different subject, and the columns might be different variables measured from each subject).

apply: "Apply to Array Dimension" function

The apply function (no prefix!) allows you to treat one of the dimensions in a matrix/array as defining the "elements" of the matrix or array, and apply the specified function to each unit along that dimension.

You control which dimension is treated as an "element" with the MARGIN argument. MARGIN = 1 means treat the rows as distinct elements, MARGIN = 2 means treat the columns as distinct elements, and so on through more dimensions.

In other words, it allows you say "Apply this specific function to each of the rows" or "Apply this specific function to each of the columns".

apply: The "Apply to Array Dimension" function

For example, we could use apply to sort each column of a matrix from smallest to largest value.

x <- matrix(rnorm(12), nrow = 4, dimnames = list(NULL, c("A","B","C")))
x

##                A          B          C
## [1,]  0.01874617  0.2945451 -1.6266727
## [2,] -0.18425254  0.3897943 -0.2564784
## [3,] -1.37133055 -1.2080762  1.1017795
## [4,] -0.59916772 -0.3636760  0.7557815

apply(x, MARGIN = 2, FUN = sort)

##                A          B          C
## [1,] -1.37133055 -1.2080762 -1.6266727
## [2,] -0.59916772 -0.3636760 -0.2564784
## [3,] -0.18425254  0.2945451  0.7557815
## [4,]  0.01874617  0.3897943  1.1017795

apply: The "Apply to Array Dimension" function

However, when you apply a function to the rows of a matrix, apply transposes the resulting matrix.

sorted_x <- apply(x, MARGIN = 1, FUN = sort)
sorted_x

##             [,1]       [,2]      [,3]       [,4]
## [1,] -1.62667268 -0.2564784 -1.371331 -0.5991677
## [2,]  0.01874617 -0.1842525 -1.208076 -0.3636760
## [3,]  0.29454513  0.3897943  1.101780  0.7557815

Though we started out with a 4 by 3 matrix, we end up with a 3 by 4 matrix after applying sort to each row!

apply: The "Apply to Array Dimension" function

Since this is probably not what you want, you'll have to transpose the output of apply when you treat the rows as the "elements".

t(sorted_x)

##            [,1]        [,2]      [,3]
## [1,] -1.6266727  0.01874617 0.2945451
## [2,] -0.2564784 -0.18425254 0.3897943
## [3,] -1.3713305 -1.20807618 1.1017795
## [4,] -0.5991677 -0.36367602 0.7557815

You'll have to do this whenever you apply a function to the rows of a matrix, and the final result is a matrix.

Summary

lapply applies the given function to each element in a data structure, and returns a list.

data <- list(A = c(5,1,10), B = c(11,8,3))
lapply(X = data, FUN = mean)

## $A
## [1] 5.333333
## 
## $B
## [1] 7.333333

sapply applies the given function to each element in a data structure, and returns the simplest possible data structure.

sapply(X = data, FUN = mean) # same input, but vector output

##        A        B 
## 5.333333 7.333333

Summary

vapply applies the given function to each element in a data structure, and returns the simplest possible data structure. Additionally, each individual result must match a given output template.

data <- list(A = c(5,1,10), B = c(11,8,3))
vapply(X = data, FUN = mean, FUN.VALUE = numeric(1))

##        A        B 
## 5.333333 7.333333

apply applies the given function along a specific dimension (e.g., rows or columns) of a matrix, array, or data frame, and returns the simplest possible data structure.

data <- matrix(c(5,1,10,11,8,3), ncol = 2)
apply(data, MARGIN = 2, FUN = mean)

## [1] 5.333333 7.333333

Activity

Find the median value of each variable in the mtcars dataset. What data structure do you think is most appropriate for holding the results?

Advanced Extras

Evaluating Expressions

What does it mean to say our programs evaluating expressions that do not have side effects or change global state?

Well, lets run the following code and describe what happens:

## [1] 1

x <- 1
x + 1

## [1] 2

x + 1

## [1] 2

No Side Effects

## [1] 1

When we run the expression 1, we the number 1 gets printed to the console. And nothing else.

Nothing about our global environment changes; we have to explicitly ask for an variable assignment with <-.

No Side Effects

x <- 1
x + 1

## [1] 2

x + 1

## [1] 2

When we assign x to be 1, and then add 1 to x, we get the same thing both times we do the addition; the value of x is unchanged by adding 1.

In other words, executing the expression x + 1 has no effect on our environment. We have to explicitly ask to make a new variable, or change the value of x.