Text Processing Basics

2018-05-04

Text Processing Basics

R is primarily a numeric computing environment, but has a wide array of tools to process textual data as well.

By textual data, I mean data stored in character vectors, e.g.,

c("I am textual data!")

The individual elements of character vectors are formally called character constants but are colloquially referred to to as strings.

Today, we'll cover some basic functions for manipulating strings and character vectors, including how to combine and split apart pieces of strings, search for strings that match a pattern, and replace portions of strings with other characters.

Elements vs. Characters

There is an important distinction between the number of elements in a character vector, and the number of characters in a single string.

For example, the character vector c("I", "Love", "You") has 3 elements ("I","Love", and "You"), but its individual strings have 1, 4, and 3 characters, respectively.

Use the length() function to determine the number of elements in a vector, and use the nchar() function to determine the number of characters in each string of the vector.

length(c("I", "Love", "You"))

## [1] 3

nchar(c("I", "Love", "You"))

## [1] 1 4 3

Changing Cases

If you want to change all the characters in a vector to lower or uppercase characters, we can use the tolower() and toupper() functions.

tolower("Good MORNING!")

## [1] "good morning!"

toupper("Good MORNING!")

## [1] "GOOD MORNING!"

This is useful when you want to compare 2 vectors for equality, without the case of the letters getting in the way of the comparison.

tolower("SEND HELP!") == "send help!"

## [1] TRUE

Combining Character Vectors

The paste function is used to combine two or more character vectors into one vector.

paste("You're", "a", "wizard", "Harry", "!")# 5 Individual strings

## [1] "You're a wizard Harry !"

This function is quite useful when you have a string represented as a variable, and you would like to place together with other text (e.g., to make a sentence).

paste("The current date and time is: ", date())

## [1] "The current date and time is:  Fri May 04 12:23:09 2018"

Combining Character Vectors

If the vectors have more than one element, and each vector has the same length, their elements will be pasted together pairwise.

paste(c("Good","Light","Right"),
      c("Bad", "Dark", "Wrong"))

## [1] "Good Bad"    "Light Dark"  "Right Wrong"

You can also choose a string that will separate the concatenated elements by using the sep argument (the default is a space).

paste(c("Good","Light","Right"),
      c("Bad", "Dark", "Wrong"),
      sep= " vs. ")

## [1] "Good vs. Bad"    "Light vs. Dark"  "Right vs. Wrong"

Combining Character Vectors

If the vectors have more than one element, but each vector does not have the same length, R will do some weird recycling of each vectors elements…

paste(c("A","B"),
      c("C","D","E"), 
      c("V","W","X","Y","Z"))

## [1] "A C V" "B D W" "A E X" "B C Y" "A D Z"

c("A","B") gets extended to c("A","B","A","B","A") and c("C","D","E") gets extended out to be c("C","D","E","C","D"). This is probably not what you want, so be careful.

Combining Character Vectors

paste() can also combine the many elements of a character vector into a single element, by supplying a value for the collapse argument.

paste(c("1","2","3"), collapse = " + ")

## [1] "1 + 2 + 3"

You can also use this together with multiple character vectors, but be careful, because the vectors elements are concatenated before the final product is collapsed!

paste(c("1","2","3"), c("4","5","6"),
      collapse = " + ")

## [1] "1 4 + 2 5 + 3 6"

"1" and "4", "2" and "5", and "3" and "6" are concatenated with a space, then "1 4", "2 5" and "3 6" are collapsed with a " + ".

Splitting vectors apart

If we want to extract some of the characters within a string, and we know the specific positions of the characters within a string we want to extract (e.g., the first through the fifth characters), we can use the substr() function.

substr() has 3 arguments:

x: the vector whose elements should have some characters extracted
start: the position of the first character to be extracted in each element
stop: the position of the last character to be extracted in each element

substr("abcdefg", start=2, stop=4)

## [1] "bcd"

Splitting vectors apart

substr is a vectorized function, so we can extract sub-strings from each string in character vectors with more than 1 elements. You can also make start and stop the same value to extract just one character.

substr(c("Hit","Eye","Low","Lamb","Oval"), start=1, stop=1)

## [1] "H" "E" "L" "L" "O"

We can also extract different substrings from each element in the character vector by supplying longer vectors to start and stop.

substr(c("Pill","xzzy","haa!"), start=c(1,2,3), stop=c(2,3,4))

## [1] "Pi" "zz" "a!"

Here, we extract the first and second characters from "Pill", the second and third characters from "xzzy", and the third and fourth characters from "haa!".

Patern Matching

What if you want to extract part of a string, but don't know for sure where the part you want will be located? Or what if you want to count the number of times a specific word occurs in some text, but the text isn't broken up by individual words already?

If you know how to describe the pattern of text you're looking for, then you can use regular expressions to search for strings and substrings that match your pattern of interest.

Regular expressions (or regexes, for short) are a double edged sword: they are extremely powerful and useful, but can easily summon Codethulu. We'll just touch on some basics today, and leave the dangerous stuff for another class.

grep

R's grep() function (an acronym for " global regular expression print") is commonly used to search for matches to a regular expression pattern within the elements of a character vector.

grep()has 2 required arguments:

pattern: A character string containing a regular expression
x: A character vector where matches to pattern are searched for

By default, grep() returns the locations of the elements in the character vector that match the regex.

But, we can have it return the actual character strings that were matched by setting the value argument to TRUE.

Literal Character Patterns

The simplest regular expressions include only literal characters, e.g., a, b, c, 1, 2, 3, etc.

For example, the regex "cat" matches the strings, "category", "catnip", and "concatenate", because the characters "c", then "a", then "t", literally appear in each one.

But, it wouldn't match "cart", because the "r" interrupts the pattern.

grep("cat", c("category", "catnip", "concatenate", "cart"), value=TRUE)

## [1] "category"    "catnip"      "concatenate"

grepl is a cousin of grep's, which returns a logical vector indicating match or mismatch.

grepl("cat", c("category", "catnip", "concatenate", "cart"))

## [1]  TRUE  TRUE  TRUE FALSE

Character Classes

We can also specify a set of possible characters to match. If any of the characters in the set occur in the string, it is considered a match.

These sets are called "character classes", and are denoted by surrounding the characters you want to match with square brackets.

So, if we wanted to find a match to a "c" or an "a", or a "t" occurring anywhere in a string, we could use the regexes [cat], or [act], or [tac], etc., order doesn't matter in a character class.

grep("[tac]", c("cat", "dog", "cart", "bar"), value=TRUE)

## [1] "cat"  "cart" "bar"

Ranges of Characters

If you want to specify a large range of characters for your set, the - symbol provides a convenient shorthand you can use to define ranges of characters.

For example, if I wanted to match any of the characters "a", "b", "c", "d", "e", "f", "g", "h", "1", "2", "3", "4", "5", or "6", I could type them all out, or use the following character ranges:

grep("[abcdefgh12345]",c("funky","town","is","#1"),value=TRUE)

## [1] "funky" "#1"

grep("[a-h1-6]",c("funky","town","is","#1"),value=TRUE)

## [1] "funky" "#1"

Replacing Characters

We can use regular expressions to describe which characters in a string we would like to replace, and substitute in new characters using the gsub() function.

gsub() takes 3 required arguments

pattern: A regular expression
replacement: The replacement string for any matches to the regular expression
x: A character vector where matches to pattern are searched for

If you want to remove the characters matching your regular expression, use an empty string (written "") for the replacement argument.

Replacing Characters

Lets imagine that some of the elements in a character vector have variable amounts trailing or leading whitespace, and we want to remove it.

text <- c("cat", "   cat", "cat      ") # 3 leading, 6 trailing spaces
text

## [1] "cat"       "   cat"    "cat      "

We could remove all space characters from the elements of the vector text with the following code:

gsub(" ", "", text)

## [1] "cat" "cat" "cat"

Replacing Characters

Now lets try the same task of stripping whitespace off the ends, but with a little more complex string:

text <- c("   The quick brown fox jumped over the lazy dog    ")
gsub(" ", "", text)

## [1] "Thequickbrownfoxjumpedoverthelazydog"

This is probably not what you wanted. The issue is that gsub doesn't discriminate between whitespace at the beginning and ends, or in the middle of the string - it replaces everything!

But, we can get around this by using anchors.

Anchoring

There are two anchors in regular expressions: the ^ anchor, which anchors a pattern to the beginning of the string, and the $ anchor, which anchors a pattern to the end of the string.

Anchoring means that if the pattern doesn't begin at the start (or conclude at the end) of the string, it won't be matched. We can combine them with the |, which means match either one.

gsub("^ ", "", text) # only replaces space at the beggining

## [1] "  The quick brown fox jumped over the lazy dog    "

gsub(" $", "", text) # only replaces space at end beggining

## [1] "   The quick brown fox jumped over the lazy dog   "

gsub("^ | $", "", text)

## [1] "  The quick brown fox jumped over the lazy dog   "

Controlling Match Size

But this time, gsub only removed 1 single space at the beginning and end. Huh?

This is because only the very first space (or very last space) matched the pattern. Remember, we asked gsub to match spaces at the beginning (or end) of the string, and it did: The second and third spaces from the beginning, by definition, don't occur at the start, and so they weren't replaced!

We can make the matching process "greedier" with repetition operators. There are 3 repetition operators in regexes:

+ : Means "match preceding character one or more times"
* : Means "match preceding character zero or more times"
? : Means "match preceding character zero or one times"

Controlling Match Size

In our case, we want to replace one or more whitespace characters, so we'll use the +.

gsub("^ +| +$", "", text)

## [1] "The quick brown fox jumped over the lazy dog"

Splitting Strings

Finally, we'll look at how to split single strings into several pieces (while still retaining both pieces in our output).

The R function for splitting strings into smaller pieces is strsplit(), which takes 2 required arguments:

x: A character vector, each element of which is to be split into multiple pieces
split: A regular expression which defines which patterns will divide the elements of x into smaller units.

Splitting Strings

The most common use case for strsplit is splitting a string based on a literal character (otherwise known as splitting on a delimiter).

strsplit(c('1,2','a,b,c,d'), ",")

## [[1]]
## [1] "1" "2"
## 
## [[2]]
## [1] "a" "b" "c" "d"

Since strsplit has no way of knowing how many pieces a given string might be split up into, it always returns a list with as many elements as there are in x. The elements of the list it returns hold the split pieces.

Splitting Strings

If empty matches are found to the pattern in split, (usually when split is ""), x is split into single characters.

strsplit('abcdefg', "")

## [[1]]
## [1] "a" "b" "c" "d" "e" "f" "g"

Splitting on Special Strings

What happens if the delimiter you want to split on is a character that has special meaning in a regex, like + or . (which is a special character class the matches any character other than line breaks)?

To make R's regular expression engine interpret those special characters as literal characters, you must prefix them with a double backslash (i.e., "\\").

strsplit("1+2", "+") # empty match

## [[1]]
## [1] "1" "+" "2"

strsplit("1+2", "\\+") # Probably what you wanted =)

## [[1]]
## [1] "1" "2"

Activity

Select the elements from the following character vectors have two or more "S" characters. How many "S" characters are there in each of those elements?

x <- c("STSTSSST","STTTTT","SSSTSTT","S","SSSTTSSTT","TTTSTT")