Star (-) Watch (-)

Introduction to R

Tidy Data

Raw data in the real-world is often untidy and poorly formatted. Furthermore, it may lack appropriate details of the study. Correcting data in place can be a dangerous exercise since the original raw data would get overwritten and there would be no way to audit this process or recover from mistakes made during this time. A good data practice would be to maintain the original data, but use a programmatic script to clean it, fix mistakes and save that cleaned dataset for further analysis. In this lesson, you will learn all about tidy data.

Question: Consider the following data below. How many variables does this dataset contain?

males females
4 1
2 5

[+/-] Solution

The way the table is presented, it seems like there are only two variables. However, the correct answer is 3. The variables are injured?, count and gender.

What is Tidy Data?

A dataset is said to be tidy if it satisfies the following conditions

  1. observations are in rows
  2. variables are in columns
  3. contained in a single dataset.

Tidy data makes it easy to carry out data analysis.

Explore Messy Data

Let us explore some common causes of messiness by inspecting a few datasets.

Income Distribution by Religion

Our first dataset is based on a survey done by Pew Research that examines the relationship between income and religious affiliation.

Read the dataset into your R session and inspect the first few rows to assess if it is tidy.

pew <- read.delim(
  file = "http://stat405.had.co.nz/data/pew.txt",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = F
)

religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k >150k
Agnostic 27 34 60 81 76 137 122 109 84
Atheist 12 27 37 52 35 70 73 59 74
Buddhist 27 21 30 34 33 58 62 39 53
Catholic 418 617 732 670 638 1116 949 792 633

[+/-] Solution

Except for religion, the rest of the columns headers are actually values of a lurking variable income. This dataset violates the second rule of tidy data.

Tuberculosis Incidence

tb <- read.csv(
  file = "http://stat405.had.co.nz/data/tb.csv",
  header = TRUE, 
  stringsAsFactors = FALSE
)

iso2 year new_sp new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524
ZW 2004 14581 187 833
ZW 2005 13155 210 837
ZW 2006 12718 215 736
ZW 2007 10583 6 132 138 500
ZW 2008 9830 127 614

[+/-] Solution

Except for iso2 and year, the rest of the columns headers are actually values of a lurking variable, in fact combination of two lurking variables, gender and age.

Weather Data

weather <- read.delim(
 file = "http://stat405.had.co.nz/data/weather.txt",
 stringsAsFactors = FALSE
)

id year month element d1 d2 d3 d4 d5 d6 d7
MX000017004 2010 10 TMIN 140 129
MX000017004 2010 11 TMAX 313 272 263
MX000017004 2010 11 TMIN 163 120 79
MX000017004 2010 12 TMAX 299 278
MX000017004 2010 12 TMIN 138 105

[+/-] Solution

This dataset seems to have two problems. First, it has variables in the rows in the column element. Second, it has a variable d in the column header spread across multiple columns.

Causes of Messiness

There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns.

  • Column headers are values, not variable names
  • Multiple variables are stored in one column
  • Variables are stored in both rows and columns
  • Multiple types of experimental unit stored in the same table
  • One type of experimental unit stored in multiple tables