Raw data in the real-world is often
untidy and poorly formatted. Furthermore, it may lack appropriate details of the study. Correcting data in place can be a dangerous exercise since the original raw data would get overwritten and there would be no way to audit this process or recover from mistakes made during this time. A good data practice would be to maintain the original data, but use a programmatic script to clean it, fix mistakes and save that cleaned dataset for further analysis. In this lesson, you will learn all about tidy data.
Question: Consider the following data below. How many variables does this dataset contain?
The way the table is presented, it seems like there are only two variables. However, the correct answer is 3. The variables are
A dataset is said to be tidy if it satisfies the following conditions
Tidy data makes it easy to carry out data analysis.
Let us explore some common causes of messiness by inspecting a few datasets.
Income Distribution by Religion
Our first dataset is based on a survey done by Pew Research that examines the relationship between income and religious affiliation.
Read the dataset into your R session and inspect the first few rows to assess if it is tidy.
pew <- read.delim( file = "http://stat405.had.co.nz/data/pew.txt", header = TRUE, stringsAsFactors = FALSE, check.names = F )
religion, the rest of the columns headers are actually values of a lurking variable
income. This dataset violates the second rule of tidy data.
tb <- read.csv( file = "http://stat405.had.co.nz/data/tb.csv", header = TRUE, stringsAsFactors = FALSE )
year, the rest of the columns headers are actually values of a lurking variable, in fact combination of two lurking variables,
weather <- read.delim( file = "http://stat405.had.co.nz/data/weather.txt", stringsAsFactors = FALSE )
This dataset seems to have two problems. First, it has variables in the rows in the column
element. Second, it has a variable
d in the column header spread across multiple columns.
There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns.