Problem Set: Data Wrangling

In this problem set, we will practice some of the key data manipulation tasks using the tools from “base” R and the dplyr package.

Libraries

library(ggplot2)
library(dplyr)

For all exercises, make sure your R code prints out your final results!!

Exercise

Try to create a vector that holds all the 4 basic data types. Does this work? What type or types of data are stored in the resulting vector?

Solution

Exercise

Find the grand sum of the all the elements in the two following vectors (using R, of course).

x <- c(149, 486, 174, 435, 188, 397, 497, 256, 346, 494)
y <- c(1839, 2709, 2422, 1547, 1686, 2159, 2929, 651, 1358, 756)

Solution

Exercise

A local grocery store sells several varieties of fruit: grapes for $4, kiwis for $2.00, mangoes for $3.00, and apples for 99 cents. Each fruit also has a “sell by” date: the grapes must be sold by February 19th, the kiwis must be sold by February 20th, the mangoes must be sold by February 18th, and the apples must be sold by February 28th.

In R, create a “tidy” data frame that represents each fruit for sale, its price, and its expiration date. Make sure each column in your data frame has an appropriate name, and carefully consider what data type you represent each piece of information with.

Solution

Exercise

From your fruit data frame, select only rows where the price is $2.00 or more.

Solution

The `txhousing` data

For the remainder of the problems, you will be working with the txhousing data set, which is included in the ggplot2 package and will be loaded when you use library(ggplot2). Once you’ve loaded the package, take a peek at the dataset below, taking note of the data types of each variable:

glimpse(txhousing)

## Rows: 8,602
## Columns: 9
## $ city      <chr> "Abilene", "Abilene", "Abilene", "Abilene", "Abilene", "Abilene"…
## $ year      <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000…
## $ month     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9…
## $ sales     <dbl> 72, 98, 130, 98, 141, 156, 152, 131, 104, 101, 100, 92, 75, 112,…
## $ volume    <dbl> 5380000, 6505000, 9285000, 9730000, 10590000, 13910000, 12635000…
## $ median    <dbl> 71400, 58700, 58100, 68600, 67300, 66900, 73500, 75000, 64500, 5…
## $ listings  <dbl> 701, 746, 784, 785, 794, 780, 742, 765, 771, 764, 721, 658, 779,…
## $ inventory <dbl> 6.3, 6.6, 6.8, 6.9, 6.8, 6.6, 6.2, 6.4, 6.5, 6.6, 6.2, 5.7, 6.8,…
## $ date      <dbl> 2000.000, 2000.083, 2000.167, 2000.250, 2000.333, 2000.417, 2000…

This data set has monthly observations of the housing market in 46 regions of Texas from the years 2000 through 2015. The city, year, month and date variables identify the city, month, year, and exact date of the observation, and the measured variables are for each observation are:

sales: Number of sales
volume: Total value of sales
median: Median sale price
listings: Total number of homes listed for sale
inventory: A “Months inventory”, a.k.a, the amount of time it would take to sell all current listings at current pace of sales.

For each problem below, always print out the final data frame that gives you the answer to the question.

Exercise

Remove any rows of the txhousing data set with missing values in the sales variable, and overwrite the original data frame. Then use the anyNA() function on the sales variable, e.g., anyNA(txhousing$sales). If you were successful, this should print out FALSE.

Solution

Exercise

Make a data set called dallas that includes data only from the city of Dallas in the years 2000 through 2010. If you did this correctly, running the command distinct(dallas, city, year) should print data frame with 11 rows.

Solution

Exercise

The sales variable holds records of the number of homes sold in a given month, year and city. For example, the first row of the entire data set tells you that 72 homes were sold in Abilene during the month of January 2001. Use the sales variable to find:

The median number of monthly home sales across all cities and all years
The median number of monthly home sales in each city during each year

Problem Set: Data Wrangling

Your Name Here!

Libraries

Exercise

Solution

Exercise

Solution

Exercise

Solution

Exercise

Solution

The txhousing data

Exercise

Solution

Exercise

Solution

Exercise

Solution

The `txhousing` data