2018-05-04

Indexing

All throughout your data analysis pipeline, you will face the need to take smaller chunks out of a larger data structure. Sometimes you will need to change the data that is stored in your structure, or use it as input to another function, or perhaps you need to plot it.

The task of slicing a smaller chunk out of a larger data structure is called indexing in R (sometimes called subsetting depending on context), and is performed using the square bracket characters [].

Indexing 'Ingredients'

Indexing a data structure in R requires 4 ingredients.

  1. An R object that supports indexing
    • e.g. A data frame, matrix, vector or list
  2. An opening square bracket [
  3. 1 (or more) vectors which indicate which values from the larger data structure should be pulled out.
  4. a closing square bracket ]

Arrange the ingredients in an R expression like so:

DataStructure[IndexVector]

We will focus on learning what can go inside the square brackets for different types of R data structures

Indexing Vectors

Broadly, there are 2 types of indexing vectors that are useful inside the square brackets.

  1. Numeric Vectors
    • e.g. c(1,4,5,6,10)
  2. Logical Vectors
    • e.g. c(TRUE,FALSE,FALSE,TRUE)

We'll start with numeric indexing vectors to get a feel for the general procedure, and move up to logical indexing.

Numeric Index Vectors

We'll start by slicing smaller chunks out of a larger vector. Here, the numeric vectors inside the brackets tell R the positions of the elements we wish to extract.

alphabet <- c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o",
              "p","q","r","s","t","u","v","w","x","y","z") 
alphabet[c(1,26)] # Extract First and 26th element
## [1] "a" "z"
alphabet[10:20] # Extract tenth through 20th
##  [1] "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t"
alphabet[seq(from=1,to=26,by=2)] # Extract every other element
##  [1] "a" "c" "e" "g" "i" "k" "m" "o" "q" "s" "u" "w" "y"

Common Errors with Numeric Indices

The most common mistake is including a value in your indexing vector which is greater than the length of the vector you are subsetting

alphabet[100] # there are not 100 letters in the alphabet
## [1] NA

The NA means the value is missing. This is commonly referred to as an "index out of bounds" error.

Another common mistake is forgetting to concatenate the values you want to use for the indexing vector (i.e. forgetting the c() function).
alphabet[1,5,10]
## Error in alphabet[1, 5, 10]: incorrect number of dimensions

Indexing Tricks

Instead of creating a vector of values you do want to pick out, it may be easier to come up with a vector of ones you don't want. We can use negative number's to specify which vector elements we don't want.

alphabet[c(-1,-26)] # Same as alphabet[2:24]
##  [1] "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r"
## [18] "s" "t" "u" "v" "w" "x" "y"
alphabet[-1:-10]
##  [1] "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

Indexing with positive vectors is usually preferred, as the intent of the code is more clear, but sometimes this is easier and more clear (e.g. when dropping the first or last value)

Activity

Execute the following code and look at the values in months:

months <- c("January", "February", "March", "April", "May", "June", "July",
            "August", "September", "October", "November", "December")
months
##  [1] "January"   "February"  "March"     "April"     "May"      
##  [6] "June"      "July"      "August"    "September" "October"  
## [11] "November"  "December"

Now, do the following:

  1. Index months to pull out "February" and "March"
  2. Index months to pull out every third month

Logical Indexing

When performing logical indexing, you supply a vector specifying whether to extract a specific element (with a TRUE) or to not extract a specific element (with a FALSE).

Let's revisit the example of selecting the first and last elements of the alphabet vector: We make a vector of logicals and stick it in the square brackets after your vector.

alphabet[c(1,26)]
## [1] "a" "z"
alphabet[c(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
           FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
           FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)]
## [1] "a" "z"

Logical Indexing

But this specific example is not a good use case for logical vectors. Why?

  1. Longer Code: length of the logical vector must match the length of the object its subsetting.
  2. Duplicating work: If you already know the position of the elements you want, just put them into a vector and you're done!

The logical vector's utility comes into play when you don't know the numeric positions of the elements you are interested in.

Logical Tests

"But wait" you say: "If we don't know where they are already, how are we going to find them?". This brings us to logical testing and relational operators

Relational operators are R expressions that test whether some value meets a condition or not.

  • If the value meets your test's condition(s), the test returns TRUE
  • If the value does not meet your test's condition(s), the test returns FALSE.

You already know lots of relational operators. The equal to, greater than and less than expressions from 3rd grade math are all relational operators!

Relational Operators Table

Comparison Expression Example
Less Than < 5 < 10
5 < 1
Less Than or Equal To <= 5 <= 5
5 <= 1
Greater Than > 10 > 5
5 > 5
Greater Than or Equal To >= 10 >= 10
10 >= 12
Equal To == 5 == 6
5 == 5
Not equal to != 5 != 6
5 != 5

Relational Operators & Indexing

What makes relational operators useful is that they can be applied to all the elements of a data structure simultaneously.

x <- 2:11
print(x)
##  [1]  2  3  4  5  6  7  8  9 10 11
x <= 5 # Apply the less than or equals test
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

As you can see, values that meet the criteria (<= 5) return as TRUE.

x[x <= 5] # Index vector x with the results of the test. 
## [1] 2 3 4 5

When this logical vector is used to index the vector x, only the elements where the logical vector has value TRUE are returned.

Relational Operators & Indexing

We index character vectors using the == and != operators, but not the greater/less than operators. Quantity makes no sense for characters!

months == "June" # The sixth element is TRUE
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE
months[months == "June"]
## [1] "June"
months[months != "July"]
##  [1] "January"   "February"  "March"     "April"     "May"      
##  [6] "June"      "August"    "September" "October"   "November" 
## [11] "December"

Other Useful Tests: is.na()

Unfortunately, we often have to deal with missing observations in real world data sets. R codes missing data as NA (or sometimes NaN). We can use the is.na() function to find any missing values in a vector.

missingno <- c(10,NA,1,4,2,NA,NA,99,NaN, NA)
is.na(missingno)
##  [1] FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
missingno[!is.na(missingno)] # Select things that are the opposite of missing
## [1] 10  1  4  2 99

Other Useful Tests: modulus division

Another important arithmetic operatoris the modulus operator %%, which gives us the remainder of division. For example: 5 %% 3 is 2, because 3 goes into 5 once, with 2 left over.

A common use case for the %% operator is to search for multiples of a number. We can do this by exploiting the fact that if one number is a multiple of another, the remainder of division will be 0.

lisa <- c(34, 509, 63, 187, 998, 78, 3330)
lisa %% 17
## [1]  0 16 12  0 12 10 15
lisa %% 17 == 0
## [1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
lisa[lisa %% 17 == 0]
## [1]  34 187

Tips and Tricks

A useful function to know is which(). When used on a logical vector, it will return to your the position indices of the vector's TRUE element. It is useful when you want to know where in the vector your matches occur.

lisa %% 17 == 0
## [1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
which(lisa %% 17 == 0)
## [1] 1 4

Activity

Write an expression using logical testing and indexing that, when applied to the vector dummy, returns the same output as the numeric indexing example shown below.

dummy <- 17:23
dummy[5:7]
## [1] 21 22 23

Testing Multiple Conditions

Sometimes you need to select elements based on multiple conditions. For example, you might want to select only those values that are less than 4 standard deviations above or below the mean.

In R, we can select elements based on multiple conditions by combining individual logical tests together using logical operators. The logical operations we have at our disposal are:

  • AND
    • Each element must meet all conditions to return TRUE
  • OR
    • Each element must meet at least one conditions to return TRUE
  • Negation
    • Reverse the current logical (e.g., TRUE becomes FALSE)

Logical Operators

Logic Expression Example Result
Elementwise AND & c(1,3) > 0 & c(1,3) <=2 TRUE, FALSE
Scalar AND && 3 > 0 && 3 <=2 FALSE
Elementwise OR | c(1,3) > 0 | c(1,3) <=2 TRUE, TRUE
Scalar OR || 3 > 0 || 3 <=2 TRUE
Negate ! !(3 > 0) FALSE

The elementwise operators test all their arguments (i.e., they test all the pairs elements of the logical vectors supplied) and return a vector the same length as the input.

The scalar operators only test the first pair of elements from their input, regardless of length, and return a single TRUE or FALSE. Thus, scalar operators should only be used for comparing a single value to another single value.

Indexing with Multiple Conditions

Let's say we wanted to select the elements in lisa which were less than 500 or greater than 1000.

lisa
## [1]   34  509   63  187  998   78 3330

To do this, we need to:

  1. Test each element of lisa to see if it is less then 500
    • lisa < 500
  2. Test each element of lisa to see if it is greater than 1000
    • lisa > 1000
  3. Combine the results of both tests together into a single test
    • ???

Indexing with Multiple Conditions

The goal is to have our test return a TRUE for each element that passes the "less than 500" test OR passes the "greater than 1000" test.

Expression Element 1 Element 2 Element 3 Element 4 Element 5 Element 6 Element 7
lisa 34 509 63 187 998 78 3330
lisa < 500 TRUE FALSE TRUE TRUE FALSE TRUE FALSE
lisa > 1000 FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Pass Either Test?

Indexing with Multiple Conditions

Our goal of the test is to have it return a TRUE for each element if it passed the "less than 500"" test OR if it passed the "greater than 1000"" test.

We can do that by combining the two tests with the | logical operator

Expression Element 1 Element 2 Element 3 Element 4 Element 5 Element 6 Element 7
lisa 34 509 63 187 998 78 3330
lisa < 500 TRUE FALSE TRUE TRUE FALSE TRUE FALSE
lisa > 1000 FALSE FALSE FALSE FALSE FALSE FALSE TRUE
lisa < 500 | lisa > 1000 TRUE FALSE TRUE TRUE FALSE TRUE TRUE
lisa[lisa < 500 | lisa > 1000]
## [1]   34   63  187   78 3330

Indexing with Multiple Conditions

Now let's say we wanted to select the elements in lisa which were less than 500 and greater than or equal to 50.

To do this, we need to:

  • Test each element of lisa to see if it is less then 500
    • lisa < 500
  • Test each element of lisa to see if it is greater than or equal to 50
    • lisa >= 50
  • Combine the results of both tests together into a single test
    • &

Indexing with Multiple Conditions

Our goal of the test is to have it return a TRUE for each element if it passed the "less than 500" test AND if it passed the "greater than 50" test.

Expression Element 1 Element 2 Element 3 Element 4 Element 5 Element 6 Element 7
lisa 34 509 63 187 998 78 3330
lisa < 500 TRUE FALSE TRUE TRUE FALSE TRUE FALSE
lisa >= 50 FALSE TRUE TRUE TRUE TRUE TRUE TRUE
lisa < 500 & lisa >= 50 FALSE FALSE TRUE TRUE FALSE TRUE FALSE


lisa[lisa < 500 & lisa >= 50]
## [1]  63 187  78

Indexing with Multiple Conditions

lisa[lisa < 500 & lisa >= 998]
## numeric(0)

When you index any structure in R with a vector of all FALSEs, you get back numeric(0) which means "nothing to see here!"

So why does this test and subset return nothing? Because no number can be less than 500 and greater than 998, so every test comes back FALSE, and all elements are ignored in the subset.

Exercise

  1. Use R to find all the numbers between 1 and 10,000 that are multiples of 2 or multiples of 3. How many are there?

  2. Use R to find all the numbers between 1 and 10,000 that are multiples of both 2 and 3. How many are there?