One way to tidy data is to reshape it so that it adheres to the three rules of tidy data. While base R has several functions aimed at reshaping data, we will use the reshape2 package by Hadley Wickham, as it provides a simple and consistent set of functions to reshape data.
In the simplest terms, reshaping data is like doing a pivot table in excel, where you shuffle columns, rows and values. Let us start by tidying the pew dataset.
pew <- read.delim(
file = "data/pew.txt",
header = TRUE,
stringsAsFactors = FALSE,
check.names = F
)
We can tidy this data using the melt function in the reshape2 package.
library(reshape2)
pew_tidy <- melt(
data = pew,
id = "religion",
variable.name = "income",
value.name = "frequency"
)
| religion | income | frequency |
|---|---|---|
| Agnostic | <$10k | 27 |
| Atheist | <$10k | 12 |
| Buddhist | <$10k | 27 |
| Catholic | <$10k | 418 |
| Don’t know/refused | <$10k | 15 |
| Evangelical Prot | <$10k | 575 |
Now that you know how to melt a data frame, use the same idea to tidy the tb dataset. Think about how you can split gender and age after melting the data
tb <- read.csv(
file = "data/tb.csv",
header = TRUE,
stringsAsFactors = FALSE
)
library(reshape2)
# set column `new_sp` to NULL and clean up column names
tb$new_sp = NULL
names(tb) <- gsub("new_sp_", "", names(tb))
# Use na.rm = TRUE to remove missing observations
tb_tidy <- melt(
data = tb,
id = c("iso2", "year"),
variable.name = "gender_age",
value.name = "cases",
na.rm = TRUE
)
# split gender_age into gender and age group
library(plyr)
tb_tidy <- mutate(tb_tidy,
gender = sub("^([m|f])(.*)$", "\\1", gender_age),
age = sub("^([m|f])(.*)$", "\\2", gender_age),
gender_age = NULL
)
tb_tidy <- tb_tidy[c('iso2', 'year', 'gender', 'age', 'cases')]
There is one more step of cleaning required to assign the age groups more meaningful names. For example, 04 stands for 0-4.
Use the same principles to clean the weather dataset.
weather <- read.delim(
file = "data/weather.txt",
stringsAsFactors = FALSE
)
We first melt the data and convert the day variable into numbers.
weather_tidy <- melt(
data = weather,
id = 1:4,
variable.name = "day",
value.name = "temparature",
na.rm = TRUE
)
weather_tidy <- mutate(weather_tidy,
day = sub("^d", "", day)
)
We still have the issue of variables TMAX and TMIN being present in rows. We can fix this using the dcast function in reshape2. Type ?dcast to read up the documentation on this function.
weather_tidy2 <- dcast(
data = weather_tidy,
formula = id + year + month + day ~ element,
value.var = "temparature"
)
Tidy the dataset billboards.csv! Note that this dataset suffers from messiness of other types. So you will need to think beyond what was done in the previous exercises.
We start by reading the dataset using read.csv. I don't like using periods in column names, and so let us convert periods to underscores.
billboards <- read.csv(
file = "http://stat405.had.co.nz/data/billboard.csv",
stringsAsFactors = FALSE
)
names(billboards) <- gsub("\\.", "_", names(billboards))
billboards <- mutate(billboards,
artist_inverted = iconv(artist_inverted, "MAC", "UTF-8"),
)
billboards_tidy <- melt(billboards,
id = 1:7,
variable.name = "week",
value.name = "rank",
na.rm = TRUE
)
billboards_tidy <- mutate(billboards_tidy,
week = as.numeric(gsub("^x([[:digit:]]+).*", "\\1", week))
)