# Reshape

One way to tidy data is to `reshape` it so that it adheres to the three rules of tidy data. While base R has several functions aimed at reshaping data, we will use the `reshape2` package by Hadley Wickham, as it provides a simple and consistent set of functions to reshape data.

### Basics

In the simplest terms, reshaping data is like doing a pivot table in excel, where you shuffle columns, rows and values. Let us start by tidying the pew dataset.

``````pew <- read.delim(
file = "data/pew.txt",
stringsAsFactors = FALSE,
check.names = F
)
``````

We can tidy this data using the `melt` function in the `reshape2` package.

``````library(reshape2)
pew_tidy <- melt(
data = pew,
id = "religion",
variable.name = "income",
value.name = "frequency"
)
``````

religion income frequency
Agnostic <\$10k 27
Atheist <\$10k 12
Buddhist <\$10k 27
Catholic <\$10k 418
Don’t know/refused <\$10k 15
Evangelical Prot <\$10k 575

### Exercise 1

Now that you know how to `melt` a data frame, use the same idea to tidy the tb dataset. Think about how you can split `gender` and `age` after melting the data

``````tb <- read.csv(
file = "data/tb.csv",
stringsAsFactors = FALSE
)
``````

### Solution 1

``````library(reshape2)

# set column `new_sp` to NULL and clean up column names
tb\$new_sp = NULL
names(tb) <- gsub("new_sp_", "", names(tb))

# Use na.rm = TRUE to remove missing observations
tb_tidy <- melt(
data = tb,
id = c("iso2", "year"),
variable.name = "gender_age",
value.name = "cases",
na.rm = TRUE
)

# split gender_age into gender and age group
library(plyr)
tb_tidy <- mutate(tb_tidy,
gender = sub("^([m|f])(.*)\$", "\\1", gender_age),
age = sub("^([m|f])(.*)\$", "\\2", gender_age),
gender_age = NULL
)
tb_tidy <- tb_tidy[c('iso2', 'year', 'gender', 'age', 'cases')]
``````

There is one more step of cleaning required to assign the `age` groups more meaningful names. For example, `04` stands for `0-4`.

### Exercise 2

Use the same principles to clean the weather dataset.

``````weather <- read.delim(
file = "data/weather.txt",
stringsAsFactors = FALSE
)
``````

### Solution 2

We first melt the data and convert the `day` variable into numbers.

``````weather_tidy <- melt(
data = weather,
id = 1:4,
variable.name = "day",
value.name = "temparature",
na.rm = TRUE
)
weather_tidy <- mutate(weather_tidy,
day = sub("^d", "", day)
)
``````

We still have the issue of variables `TMAX` and `TMIN` being present in rows. We can fix this using the `dcast` function in `reshape2`. Type `?dcast` to read up the documentation on this function.

``````weather_tidy2 <- dcast(
data = weather_tidy,
formula = id + year + month + day ~ element,
value.var = "temparature"
)
``````

### Exercise 3

Tidy the dataset `billboards.csv`! Note that this dataset suffers from messiness of other types. So you will need to think beyond what was done in the previous exercises.

### Solution 3

We start by reading the dataset using `read.csv`. I don't like using periods in column names, and so let us convert periods to underscores.

``````billboards <- read.csv(
stringsAsFactors = FALSE
)
names(billboards) <- gsub("\\.", "_", names(billboards))
``````
``````billboards <- mutate(billboards,
artist_inverted = iconv(artist_inverted, "MAC", "UTF-8"),
)
billboards_tidy <- melt(billboards,
id = 1:7,
variable.name = "week",
value.name = "rank",
na.rm = TRUE
)
billboards_tidy <- mutate(billboards_tidy,
week = as.numeric(gsub("^x([[:digit:]]+).*", "\\1", week))
)
``````