2018-05-04

What is a data type?

  • The data type of piece of information is defined by
    • how the information is represented in hardware (e.g., CPU, memory)
    • the range of values it can take on
    • the types of operations that can be performed with it in your software
  • R uses many different data types, but the most important to learn are integer, double, character, and logical
    • These are the atomic data types in R, because they are the smallest possible building blocks

Integers & Doubles

  • Just as in mathematics, the integer data type can only represent whole numbers
    • e.g. 6
  • The double data type is used to represent all real numbers (e.g.,numbers with a fractional component).
    • e.g. 6.66666667
  • However, the double data type is not perfectly precise, and numbers with a long decimal component must be approximated by the computer.
  • By default, R uses the double data type to represent the numbers you feed into it
  • Doubles and integers together are referred to as the "numeric" data types.

Character

  • The character data type (sometimes referred to as the "string" data type) is used for representing textual data
  • To encode a string of text as character data in R, it must be wrapped in quotes (" " or ' ' are both acceptable)
  • Without the quotes, R will interpret the text as the name of an R object, and attempt to find that object and return its value. Missing quotes is a common source of "object not found" errors
a <- "foobar"
a
## [1] "foobar"
a <- foobar 
## Error in eval(expr, envir, enclos): object 'foobar' not found

Numbers as Numerics vs. Numbers as Characters

  • 4.2 can be represented in R as both a character and a double. But, only in one case can mathematical operations be performed on them.
"4.2" + 1
## Error in "4.2" + 1: non-numeric argument to binary operator
4.2 + 1
## [1] 5.2

This illustrates that practical differences between data types we care about are that some types do not support some operations (e.g. characters do not support math)

Logical

  • Logical data can only take on 2 possible values: 1 (TRUE) or 0 (FALSE)
  • This type of datum is used to represent whether some state exists (is true) or does not exist (is false)
  • TRUE and FALSE must be uppercased
    • can be abbreviated as T and F, but it is not recommended
    • T and F can be over-written (e.g. T <- 2)
    • But TRUE and FALSE are special and can never be overwritten
  • This type of data will be more useful when we learn about data manipulation and program control flow.

Checking Data Types

You can use the typeof function to see what data type an object holds.

typeof(2)
## [1] "double"
typeof(2L) # L at the end makes an integer
## [1] "integer"
typeof("herp")
## [1] "character"
typeof(FALSE)
## [1] "logical"

Coercion

  • Values stored in one data type can often be changed to another data type. This transformation is called coercion
as.double("4.2") + 1
## [1] 5.2
  • Sometimes when an operation requires a specific data type, R will coerce things to the proper types for us.
TRUE + 1 # coerce TRUE to integer and add
## [1] 2
  • But things can't always be magically perfect
"TRUE" + 1
## Error in "TRUE" + 1: non-numeric argument to binary operator

Data Structures

Of course, it is helpful in a language focused on analyzing data to have the ability to group multiple values together.

Think of data structures in R as big containers used for grouping together many values. After storing your data in these containers, you can reuse it multiple places (e.g. create an R object to store it) or access different subsets of it by position or name.

In this course, we will primarily be working with 4 different types of data structures:
  1. Vectors
  2. Matrices
  3. Data Frames
  4. Lists

Vectors

Vectors are 1 dimensional data structures which can hold numeric, logical, and character data.

Data types may not be mixed in a vector (e.g. you cannot have some elements be characters and other be integers)

The individual values held in the vector are referred to as elements, and vectors have a length equal to the number of elements it contains.

Vectors are the most basic data structures in R. In fact, all the scalar values we have worked with so far are represented internally by R as vectors of length 1!!

Creating a vector

Vectors, no matter what type of data they hold, can be created by using the c() function in R, short for concatenate.

Just place each value you want to be included in the vector inside the parenthesis, separated with a comma.

new_vector <- c(1,10,45,-1)
char_vector <- c("foo","bar","herp","derp")

c() can combine existing vectors as well, not just create new ones.

c(new_vector,c(1,2,3,4,5))
## [1]  1 10 45 -1  1  2  3  4  5

Names

After you create a vector, you can give the elements names, using the names() function, and a character vector.

new_vector
## [1]  1 10 45 -1
names(new_vector) <- c("A","B","C","D")
new_vector
##  A  B  C  D 
##  1 10 45 -1

This can be useful later when you want to pick out a few values from the larger vector (we'll talk more about this sort of thing in the next lab).

Vector Tricks

You can create a sequential numeric vector using the colon operator :, instead of typing out hundreds or thousands of values by hand.

55:100
##  [1]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
## [18]  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88
## [35]  89  90  91  92  93  94  95  96  97  98  99 100

Sequences of other step sizes can be made with the seq() function

seq(from = 5, to = 22, by = 3)
## [1]  5  8 11 14 17 20

Element-wise Math operations

You can apply the math operators we used in the first class to vectors as well. When you add, divide, multiply, or subtract a set of vectors, R matches the vectors up by position and applies the operation to each pair of elements.

vec1 <- c(5,-1,100,75)
vec2 <- c(10,3,4,-4)
vec2 + vec1
## [1]  15   2 104  71
vec2 * vec1
## [1]   50   -3  400 -300

So, the first element of vec1 gets added to the first element of vec2, etc…

If you want to find the grand sum or product of all the elements in a numeric vector, use the sum() and prod() functions, respectively.

Activity

  1. Try to create a vector that uses all the 4 basic data types. What type of data is stored in the resulting vector?
  2. Take vec1 and vec2 you created earlier and concatenate them together, and find the grand sum.

Matrices

Matrices in R are the same as matrices from linear algebra, with the notable difference of being able to hold non-numeric types of data.

They are a rectangular 2-D data structure, meaning that the number of elements in the matrix must be equal to the product of the number of rows and the number of columns.

  • rows = dimension 1
  • columns = dimension 2

Like vectors, data types may not be mixed in a matrix (e.g. you cannot have some elements be characters and other be numeric, etc.)

Creating Matrices

A matrix is created by feeding a vector into the matrix() function, and specifying either the number of rows or number of columns, or both.

wide <- matrix(c(1:3,99:101),ncol=3)
wide
##      [,1] [,2] [,3]
## [1,]    1    3  100
## [2,]    2   99  101
long <- matrix(c(1:3,99:101),nrow=3)
long
##      [,1] [,2]
## [1,]    1   99
## [2,]    2  100
## [3,]    3  101

Creating Matrices

If you want a large matrix but with only a few unique values, take advantage of R's ability to recycle input.

matrix(c(4),nrow=3, ncol=10)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    4    4    4    4    4    4    4    4    4     4
## [2,]    4    4    4    4    4    4    4    4    4     4
## [3,]    4    4    4    4    4    4    4    4    4     4

Note the matrix is filled up by column (i.e. first element goes to row 1 column 1, second element goes to row 2, column 1, etc.)

matrix(c(4,0), nrow=2, ncol=10)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    4    4    4    4    4    4    4    4    4     4
## [2,]    0    0    0    0    0    0    0    0    0     0

Matrix Operations

You can inspect the dimensionality of a matrix (or any R object) with the dim function.

dim(long) # 3 by 2
## [1] 3 2
dim(wide) # 2 by 3
## [1] 2 3

You can get the total number of elements in the matrix by multiplying these numbers

prod(dim(wide))
## [1] 6

Matrix Math

When doing math on matrices with the +, *, / or - operators, R will by default applies the operations element-wise (i.e., add the first element of one matrix to the first element of the other matrix, etc.).

But you can perform matrix multiplication with the %*% operator.

long %*% wide
##      [,1]  [,2]  [,3]
## [1,]  199  9804 10099
## [2,]  202  9906 10300
## [3,]  205 10008 10501

Activity

Try creating a matrix with 900 elements and 9 columns that contains only logical TRUEs. Confirm the dimensionality using R.

Data Frames

Like matrices, Data Frames are are 2D, rectangular data structures, but are more flexible because they allow for different data types to be stored in each column.

A Data Frame is usually the best way to store and work with data that mixes qualitative and quantitative variables.

Creating Data Frames

Data frames can be created by passing name = value pairs to the data.frame() function. The values should be vectors (of any type) and the names should be unquoted strings of text, which will be used to label each column. Importantly, all the vectors stored in the data frame must be of identical length.

Essentially, a data frame is a container that imposes a relational structure on a set of vectors.

df <- data.frame(x=c(1,4,4,2), y=c(3,3,1,4),
                 month=c("Sep","Oct","Nov","Jan"),
                 stringsAsFactors = FALSE)
df
##   x y month
## 1 1 3   Sep
## 2 4 3   Oct
## 3 4 1   Nov
## 4 2 4   Jan

stringsAsFactors?

This example also shows an extra argument, stringsAsFactors = FALSE, that was not a variable to store in the data frame. This argument controls how R interprets character vectors (a.k.a. strings) when forming the data frame.

If stringsAsFactors = TRUE, the default setting, then R converts your character vector to a factor variable in the data frame.

A factor variable is a class of vector where all the possible values are known and already included in the vector. A good example of a factor variable would be the gender responses to a survey where there was a box for male, and a box for female; there are only 2 possible values the data can take on, and you know them to be valid a-priori.

If stringsAsFactors = FALSE, then R leaves the input data alone when forming the data frame.

stringsAsFactors?

What should you use?

99% of the time, you'll want your data frame to have a plain old character vector for a variable. They're inherently more flexible, and if you decide later you do want a factor variable, you can always convert it into one with the factor() function.

So, get into the habit of specifying stringsAsFactors = FALSE when you use functions that create data frames (e.g., data.frame(), read.csv(), read.table(), etc.)

Or, you can change the setting globally for the length of your R session by using options(stringsAsFactors = FALSE).

If you want to know more about the reasons for all this nonsense, stringsAsFactors: An unauthorized biography and stringsAsFactors = are two great reads.

Data Frame Tricks (and Traps)

The data.frame() function can also recycle input, so you can save some keystrokes if you have a column that has only a few unique values that repeat.

R compares the length of the first vector to the length of each vector after it. If the lengths don't match, R will check if the length of one vector is a multiple of the length of the other vector.

If so, R will repeat the shorter vector over and over so the lengths match up. In this example, the x variable has length 4, and the month vector has length 1; since 4 is a multiple of 1 (1 x 4 = 4), R will repeat the month vector 4 times so the lengths match.

data.frame(x=c(1,4,4,2),y=c(3,3,1,4),month="Sep")
##   x y month
## 1 1 3   Sep
## 2 4 3   Sep
## 3 4 1   Sep
## 4 2 4   Sep

Data Frame Tricks (and Traps)

The data.frame() function will throw an error if the length of the longer vector is not a multiple of the shorter vector's lengths.

Here, the x variable still has length 4, but the month vector has length 3; since 4 is not multiple of 3, an error occurs.

data.frame(x=c(1,4,4,2),y=c(3,3,1,4),
           month=c("Sep","Oct","Nov")) 
## Error in data.frame(x = c(1, 4, 4, 2), y = c(3, 3, 1, 4), month = c("Sep", : arguments imply differing number of rows: 4, 3

Recycling enables concise code, but can make for hard to spot bugs when you didn't intended for recycling to happen. Use this feature with care, and favor being explicit about your vector lengths when programming.

Lists

Lists are the most abstract and flexible data structure in R. Lists can hold any type of R object, but doesn't impose any relationship between them.

You can have a list holding matrices, data frames, vectors, and even lists holding other lists!

Lists

Think of lists like a folder on your hard drive.

You can stuff any kind of file you like in there and give it a name, but there is no relationship between them inside that folder, other than the order they are sorted in.

Use a list when you need to group data structures of different sizes and types together. But carefully consider if there is another way, because the lack of structured relationships between the data in different list elements can make them tricky to work with

Creating Lists

As you might have guessed, you can create lists of your own with the list() function.

Like the data.frame() function, you pass in name=value pairs. But now, the values can be any R object, of any size, not just vectors with the same lengths.

biglist <- list(first = -10:-15, second = data.frame(x=c("A","B"), y = 1:2))
biglist
## $first
## [1] -10 -11 -12 -13 -14 -15
## 
## $second
##   x y
## 1 A 1
## 2 B 2

List Tips

  • Avoid nesting lists if you can. One level of nesting is usually all right, but more than 1 is pain.
  • Use the str() function to examine the structure of the list if you're having problems working with it. This function will tell you useful information about any R object's internal structure, such as the class of the data structure and data types it hold, all the unique values in it, etc.
str(biglist)
## List of 2
##  $ first : int [1:6] -10 -11 -12 -13 -14 -15
##  $ second:'data.frame':  2 obs. of  2 variables:
##   ..$ x: Factor w/ 2 levels "A","B": 1 2
##   ..$ y: int [1:2] 1 2

Activity

Create a data frame that holds you and your neighbor(s) first name, last name, and age. Think about the appropriate variables your data frame should have to represent this information in a well-organized manner

Then, put this data frame in a list with a matrix. Report the details of this lists structure and the structure of the objects it holds, using an R function.