Star (-) Watch (-)

Introduction to R

Reshape

One way to tidy data is to reshape it so that it adheres to the three rules of tidy data. While base R has several functions aimed at reshaping data, we will use the reshape2 package by Hadley Wickham, as it provides a simple and consistent set of functions to reshape data.

Basics

In the simplest terms, reshaping data is like doing a pivot table in excel, where you shuffle columns, rows and values. Let us start by tidying the pew dataset.

pew <- read.delim(
  file = "data/pew.txt",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = F
)

We can tidy this data using the melt function in the reshape2 package.

library(reshape2)
pew_tidy <- melt(
  data = pew,
  id = "religion",
  variable.name = "income",
  value.name = "frequency"
)

religion	income	frequency
Agnostic	<$10k	27
Atheist	<$10k	12
Buddhist	<$10k	27
Catholic	<$10k	418
Don’t know/refused	<$10k	15
Evangelical Prot	<$10k	575

Exercise 1

Now that you know how to melt a data frame, use the same idea to tidy the tb dataset. Think about how you can split gender and age after melting the data

tb <- read.csv(
  file = "data/tb.csv",
  header = TRUE, 
  stringsAsFactors = FALSE
)

Solution 1

library(reshape2)

# set column `new_sp` to NULL and clean up column names
tb$new_sp = NULL
names(tb) <- gsub("new_sp_", "", names(tb))

# Use na.rm = TRUE to remove missing observations
tb_tidy <- melt(
  data = tb,
  id = c("iso2", "year"),
  variable.name = "gender_age",
  value.name = "cases",
  na.rm = TRUE
)

# split gender_age into gender and age group
library(plyr)
tb_tidy <- mutate(tb_tidy,
  gender = sub("^([m|f])(.*)$", "\\1", gender_age),
  age = sub("^([m|f])(.*)$", "\\2", gender_age),
  gender_age = NULL
)
tb_tidy <- tb_tidy[c('iso2', 'year', 'gender', 'age', 'cases')]

There is one more step of cleaning required to assign the age groups more meaningful names. For example, 04 stands for 0-4.

Exercise 2

Use the same principles to clean the weather dataset.

weather <- read.delim(
 file = "data/weather.txt",
 stringsAsFactors = FALSE
)

Solution 2

We first melt the data and convert the day variable into numbers.

weather_tidy <- melt(
  data = weather,  
  id = 1:4,
  variable.name = "day",
  value.name = "temparature",
  na.rm = TRUE
)
weather_tidy <- mutate(weather_tidy,
  day = sub("^d", "", day)  
)

We still have the issue of variables TMAX and TMIN being present in rows. We can fix this using the dcast function in reshape2. Type ?dcast to read up the documentation on this function.

weather_tidy2 <- dcast(
  data = weather_tidy,
  formula = id + year + month + day ~ element, 
  value.var = "temparature"
)

Exercise 3

Tidy the dataset billboards.csv! Note that this dataset suffers from messiness of other types. So you will need to think beyond what was done in the previous exercises.

Solution 3

We start by reading the dataset using read.csv. I don't like using periods in column names, and so let us convert periods to underscores.

billboards <- read.csv(
  file = "http://stat405.had.co.nz/data/billboard.csv",
  stringsAsFactors = FALSE
)
names(billboards) <- gsub("\\.", "_", names(billboards))

billboards <- mutate(billboards,
  artist_inverted = iconv(artist_inverted, "MAC", "UTF-8"),
)  
billboards_tidy <- melt(billboards, 
  id = 1:7,
  variable.name = "week",
  value.name = "rank",
  na.rm = TRUE
)
billboards_tidy <- mutate(billboards_tidy,
  week = as.numeric(gsub("^x([[:digit:]]+).*", "\\1", week))  
)