Raw data in the real-world is often untidy
and poorly formatted. Furthermore, it may lack appropriate details of the study. Correcting data in place can be a dangerous exercise since the original raw data would get overwritten and there would be no way to audit this process or recover from mistakes made during this time. A good data practice would be to maintain the original data, but use a programmatic script to clean it, fix mistakes and save that cleaned dataset for further analysis. In this lesson, you will learn all about tidy data.
Question: Consider the following data below. How many variables does this dataset contain?
males | females |
---|---|
4 | 1 |
2 | 5 |
The way the table is presented, it seems like there are only two variables. However, the correct answer is 3. The variables are injured?
, count
and gender
.
A dataset is said to be tidy if it satisfies the following conditions
Tidy data makes it easy to carry out data analysis.
Let us explore some common causes of messiness by inspecting a few datasets.
Income Distribution by Religion
Our first dataset is based on a survey done by Pew Research that examines the relationship between income and religious affiliation.
Read the dataset into your R session and inspect the first few rows to assess if it is tidy.
pew <- read.delim(
file = "http://stat405.had.co.nz/data/pew.txt",
header = TRUE,
stringsAsFactors = FALSE,
check.names = F
)
religion | <$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k | $75-100k | $100-150k | >150k |
---|---|---|---|---|---|---|---|---|---|
Agnostic | 27 | 34 | 60 | 81 | 76 | 137 | 122 | 109 | 84 |
Atheist | 12 | 27 | 37 | 52 | 35 | 70 | 73 | 59 | 74 |
Buddhist | 27 | 21 | 30 | 34 | 33 | 58 | 62 | 39 | 53 |
Catholic | 418 | 617 | 732 | 670 | 638 | 1116 | 949 | 792 | 633 |
Except for religion
, the rest of the columns headers are actually values of a lurking variable income
. This dataset violates the second rule of tidy data.
Tuberculosis Incidence
tb <- read.csv(
file = "http://stat405.had.co.nz/data/tb.csv",
header = TRUE,
stringsAsFactors = FALSE
)
iso2 | year | new_sp | new_sp_m04 | new_sp_m514 | new_sp_m014 | new_sp_m1524 |
---|---|---|---|---|---|---|
ZW | 2004 | 14581 | 187 | 833 | ||
ZW | 2005 | 13155 | 210 | 837 | ||
ZW | 2006 | 12718 | 215 | 736 | ||
ZW | 2007 | 10583 | 6 | 132 | 138 | 500 |
ZW | 2008 | 9830 | 127 | 614 |
Except for iso2
and year
, the rest of the columns headers are actually values of a lurking variable, in fact combination of two lurking variables, gender
and age
.
Weather Data
weather <- read.delim(
file = "http://stat405.had.co.nz/data/weather.txt",
stringsAsFactors = FALSE
)
id | year | month | element | d1 | d2 | d3 | d4 | d5 | d6 | d7 |
---|---|---|---|---|---|---|---|---|---|---|
MX000017004 | 2010 | 10 | TMIN | 140 | 129 | |||||
MX000017004 | 2010 | 11 | TMAX | 313 | 272 | 263 | ||||
MX000017004 | 2010 | 11 | TMIN | 163 | 120 | 79 | ||||
MX000017004 | 2010 | 12 | TMAX | 299 | 278 | |||||
MX000017004 | 2010 | 12 | TMIN | 138 | 105 |
This dataset seems to have two problems. First, it has variables in the rows in the column element
. Second, it has a variable d
in the column header spread across multiple columns.
There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns.