One way to tidy data is to reshape
it so that it adheres to the three rules of tidy data. While base R has several functions aimed at reshaping data, we will use the reshape2
package by Hadley Wickham, as it provides a simple and consistent set of functions to reshape data.
In the simplest terms, reshaping data is like doing a pivot table in excel, where you shuffle columns, rows and values. Let us start by tidying the pew dataset.
pew <- read.delim(
file = "data/pew.txt",
header = TRUE,
stringsAsFactors = FALSE,
check.names = F
)
We can tidy this data using the melt
function in the reshape2
package.
library(reshape2)
pew_tidy <- melt(
data = pew,
id = "religion",
variable.name = "income",
value.name = "frequency"
)
religion | income | frequency |
---|---|---|
Agnostic | <$10k | 27 |
Atheist | <$10k | 12 |
Buddhist | <$10k | 27 |
Catholic | <$10k | 418 |
Don’t know/refused | <$10k | 15 |
Evangelical Prot | <$10k | 575 |
Now that you know how to melt
a data frame, use the same idea to tidy the tb dataset. Think about how you can split gender
and age
after melting the data
tb <- read.csv(
file = "data/tb.csv",
header = TRUE,
stringsAsFactors = FALSE
)
library(reshape2)
# set column `new_sp` to NULL and clean up column names
tb$new_sp = NULL
names(tb) <- gsub("new_sp_", "", names(tb))
# Use na.rm = TRUE to remove missing observations
tb_tidy <- melt(
data = tb,
id = c("iso2", "year"),
variable.name = "gender_age",
value.name = "cases",
na.rm = TRUE
)
# split gender_age into gender and age group
library(plyr)
tb_tidy <- mutate(tb_tidy,
gender = sub("^([m|f])(.*)$", "\\1", gender_age),
age = sub("^([m|f])(.*)$", "\\2", gender_age),
gender_age = NULL
)
tb_tidy <- tb_tidy[c('iso2', 'year', 'gender', 'age', 'cases')]
There is one more step of cleaning required to assign the age
groups more meaningful names. For example, 04
stands for 0-4
.
Use the same principles to clean the weather dataset.
weather <- read.delim(
file = "data/weather.txt",
stringsAsFactors = FALSE
)
We first melt the data and convert the day
variable into numbers.
weather_tidy <- melt(
data = weather,
id = 1:4,
variable.name = "day",
value.name = "temparature",
na.rm = TRUE
)
weather_tidy <- mutate(weather_tidy,
day = sub("^d", "", day)
)
We still have the issue of variables TMAX
and TMIN
being present in rows. We can fix this using the dcast
function in reshape2
. Type ?dcast
to read up the documentation on this function.
weather_tidy2 <- dcast(
data = weather_tidy,
formula = id + year + month + day ~ element,
value.var = "temparature"
)
Tidy the dataset billboards.csv
! Note that this dataset suffers from messiness of other types. So you will need to think beyond what was done in the previous exercises.
We start by reading the dataset using read.csv
. I don't like using periods in column names, and so let us convert periods to underscores.
billboards <- read.csv(
file = "http://stat405.had.co.nz/data/billboard.csv",
stringsAsFactors = FALSE
)
names(billboards) <- gsub("\\.", "_", names(billboards))
billboards <- mutate(billboards,
artist_inverted = iconv(artist_inverted, "MAC", "UTF-8"),
)
billboards_tidy <- melt(billboards,
id = 1:7,
variable.name = "week",
value.name = "rank",
na.rm = TRUE
)
billboards_tidy <- mutate(billboards_tidy,
week = as.numeric(gsub("^x([[:digit:]]+).*", "\\1", week))
)