Manipulating Data, Part 3: The Data Returns

2018-05-04

Replacing and Removing Values

You can use indexing operation on the left hand side of an assignment operation to remove or replace values in your data structure. The basic recipe looks like:

DataObject[IndexVector] <- NewValues

A note of caution: this is an irreversible operation, so make a backup copy of your data structure if you're uncertain what will happen:

backup_object <- DataObject

DataObject[LogicalCriteria] <- NewValues

Replacing Values

Lets change the some of pesticide names in the spray column of the InsectSprays data frame to be more informative than just "A", "B", "C", etc.

First, coerce the spray variable from a factor vector into a character vector, for reasons…

InsectSprays$spray <- as.character(InsectSprays$spray)

Then, subset the combination of rows and columns you wish to overwrite, and assign a replacement value to them.

InsectSprays[InsectSprays$spray=='A','spray'] <- "SPRAY_OF_DOOOOM"
InsectSprays[InsectSprays$spray=='B','spray'] <- "fairy_dust"
InsectSprays[c(1,21),]

##    count           spray
## 1     10 SPRAY_OF_DOOOOM
## 21    19      fairy_dust

Removing Columns or List Elements

You can remove single columns from a data frame column or single elements from a list by setting their values to be the NULL object.

backup_iris <- iris
ncol(iris)

## [1] 5

iris$Sepal.Length <- NULL
str(iris)

## 'data.frame':    150 obs. of  4 variables:
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Removing Multiple Rows and Columns

Unfortunately, this method of assigning values to be NULL isn't a general solution for all data structures.

Instead, we'll have to subset our data structure, and over-writing the existing variable with our subset.

In these situations, some times it's useful to re-frame the problem; instead of thinking of your task as "delete these things", try thinking of it as "keep everything else".

Removing Multiple Rows and Columns

For instance, if you want to remove the first 5 rows of a matrix or data frame, you can use negative integers in the your index vector.

iris

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
##  [ reached getOption("max.print") -- omitted 148 rows ]

iris <- iris[-(1:5),]
iris

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
##  [ reached getOption("max.print") -- omitted 143 rows ]

Removing Multiple Rows and Columns

But, if you wanted to remove all the Petal-related columns from the iris dataset, you could accomplish that by subsetting just the Sepal.Length, Sepal.Width, and Species columns; you delete the Petal-related columns by omitting them.

sepal_only_data <- iris[,c("Sepal.Length", "Sepal.Width", "Species")]
str(sepal_only_data)

## 'data.frame':    150 obs. of  3 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Activity

Use the airquality data frame and do the following:

Remove the Wind column.
Find the is.na() function to find and remove rows that are missing observations in the Ozone column.
Replace the entries in the Day column that have value 1 with the character string 'Sunday'.

Combining Data Structures

While subsetting and replacing values is probably a more common operation, you'll often want to add additional data to an existing data structure.

Some functions commonly used to grow and combine data structures are:

c(): Concatenate
$: Extract/Replace
cbind(): Combine horizontally (side to side)
rbind(): Combine vertically (top to bottom)

Concatenation

As mentioned previously, the c() function is used to concatenate multiple vectors into a single vector.

Because individual numeric and character values are interpreted as vectors of length 1, c() can be used to make "new" vectors.

odd <- c(1, 3, 5)
even <- c(2, 4, 6)
c(odd, even)

## [1] 1 3 5 2 4 6

Concatenation

Because lists are also considered vectors, the c() function can be used to combine multiple lists into one.

x <- list(odd = c(1, 3, 5))
y <- list(even = c(2, 4, 6))
z <- c(x, y)
print(z)

## $odd
## [1] 1 3 5
## 
## $even
## [1] 2 4 6

Adding new data to lists

The $ operator can be used to add new named variables to lists

z$random = matrix(rnorm(6), ncol=3)
print(z)

## $odd
## [1] 1 3 5
## 
## $even
## [1] 2 4 6
## 
## $random
##            [,1]        [,2]      [,3]
## [1,] -0.1017610 -1.85374045 0.9685663
## [2,] -0.2537805 -0.07794607 0.1849260

Adding new data to lists

The $ operator can also add new named variables to data frames. Here, there is additional constraint that the variable added must be of the same length as the data frame, or be able to be recycled to match the length of the data frame.

options(stringsAsFactors = FALSE)
words <- data.frame(word = c("good", "night"))
words$n_letters <- c(4, 5)
print(words)

##    word n_letters
## 1  good         4
## 2 night         5

Column-binding

The cbind function (short for "column bind"") can also add new variables to data frames, but has more general utility than the $.

For example, multiple columns can be added simultaneously:

words <- cbind(words,
               freq = c("high", "low"),
               opposite = c("evil", "day"))
print(words)

##    word n_letters freq opposite
## 1  good         4 high     evil
## 2 night         5  low      day

Column-binding

You can also combine two data frames with the same number of rows into one

df_one <- data.frame(letters)
df_two <- data.frame(rnorm(length(letters)))
cbind(df_one, df_two)

##    letters rnorm.length.letters..
## 1        a            -1.37994358
## 2        b            -1.43551436
## 3        c             0.36208723
##  [ reached getOption("max.print") -- omitted 23 rows ]

Column-binding

The cbind() function can also combine matrices together, and combine multiple vectors into a matrix

x <- cbind(odd, even) # odd = c(1,3,5), even = c(2,4,6)
y <- cbind(x, matrix(rnorm(n = 6), nrow = 3))
print(y)

##      odd even                     
## [1,]   1    2 -1.2375945 0.3401156
##  [ reached getOption("max.print") -- omitted 2 rows ]

Row-binding

The rbind() function provides equivalent operations to the cbind function, but combines data structures vertically instead of horizontally.

x <- rbind(odd, even)
y <- rbind(x, matrix(rnorm(n = 6), ncol = 3))
print(y)

##            [,1]       [,2]       [,3]
## odd   1.0000000  3.0000000  5.0000000
## even  2.0000000  4.0000000  6.0000000
##  [ reached getOption("max.print") -- omitted 2 rows ]

Row-binding

The rbind() function can be used with data frames as well, but there are some caveats.

If you want to add a vector to a data frame as a new row, the vector must:

Have the same number of elements as there are columns in the data frame
Have the same type as (or be coercible to) the type in the corresponding column of the data frame.
- This one can be tricky, because you can't mix types in a vector

Row-binding

If you want to combine two data frames together vertically (i.e., top to bottom), then they must have:

The same number of columns
The same column names

So, you can't rbind() two data frames with different variables.

Merging Datasets

Sometimes you'll want to combine two datasets based on common values in a shared variable.

We'll cover these types of merging operations, called "joins", later in the course. But if you are certain you need to understand how to do this before we cover them in class, come see me =)