In this problem set, we will practice some of the key data manipulation tasks using the tools from “base” R and the dplyr
package.
library(ggplot2)
library(dplyr)
For all exercises, make sure your R code prints out your final results!!
Try to create a vector that holds all the 4 basic data types. Does this work? What type or types of data are stored in the resulting vector?
Find the grand sum of the all the elements in the two following vectors (using R, of course).
x <- c(149, 486, 174, 435, 188, 397, 497, 256, 346, 494)
y <- c(1839, 2709, 2422, 1547, 1686, 2159, 2929, 651, 1358, 756)
A local grocery store sells several varieties of fruit: grapes for $4, kiwis for $2.00, mangoes for $3.00, and apples for 99 cents. Each fruit also has a “sell by” date: the grapes must be sold by February 19th, the kiwis must be sold by February 20th, the mangoes must be sold by February 18th, and the apples must be sold by February 28th.
In R, create a “tidy” data frame that represents each fruit for sale, its price, and its expiration date. Make sure each column in your data frame has an appropriate name, and carefully consider what data type you represent each piece of information with.
From your fruit data frame, select only rows where the price is $2.00 or more.
txhousing
dataFor the remainder of the problems, you will be working with the txhousing
data set, which is included in the ggplot2
package and will be loaded when you use library(ggplot2)
. Once you’ve loaded the package, take a peek at the dataset below, taking note of the data types of each variable:
glimpse(txhousing)
## Rows: 8,602
## Columns: 9
## $ city <chr> "Abilene", "Abilene", "Abilene", "Abilene", "Abilene", "Abilene"…
## $ year <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000…
## $ month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9…
## $ sales <dbl> 72, 98, 130, 98, 141, 156, 152, 131, 104, 101, 100, 92, 75, 112,…
## $ volume <dbl> 5380000, 6505000, 9285000, 9730000, 10590000, 13910000, 12635000…
## $ median <dbl> 71400, 58700, 58100, 68600, 67300, 66900, 73500, 75000, 64500, 5…
## $ listings <dbl> 701, 746, 784, 785, 794, 780, 742, 765, 771, 764, 721, 658, 779,…
## $ inventory <dbl> 6.3, 6.6, 6.8, 6.9, 6.8, 6.6, 6.2, 6.4, 6.5, 6.6, 6.2, 5.7, 6.8,…
## $ date <dbl> 2000.000, 2000.083, 2000.167, 2000.250, 2000.333, 2000.417, 2000…
This data set has monthly observations of the housing market in 46 regions of Texas from the years 2000 through 2015. The city
, year
, month
and date
variables identify the city, month, year, and exact date of the observation, and the measured variables are for each observation are:
sales
: Number of salesvolume
: Total value of salesmedian
: Median sale pricelistings
: Total number of homes listed for saleinventory
: A “Months inventory”, a.k.a, the amount of time it would take to sell all current listings at current pace of sales.For each problem below, always print out the final data frame that gives you the answer to the question.
Remove any rows of the txhousing
data set with missing values in the sales
variable, and overwrite the original data frame. Then use the anyNA()
function on the sales
variable, e.g., anyNA(txhousing$sales)
. If you were successful, this should print out FALSE
.
Make a data set called dallas
that includes data only from the city of Dallas in the years 2000 through 2010. If you did this correctly, running the command distinct(dallas, city, year)
should print data frame with 11 rows.
The sales
variable holds records of the number of homes sold in a given month, year and city. For example, the first row of the entire data set tells you that 72 homes were sold in Abilene during the month of January 2001. Use the sales
variable to find: