Star (-) Watch (-)

Introduction to R

Tidy Data

Raw data in the real-world is often untidy and poorly formatted. Furthermore, it may lack appropriate details of the study. Correcting data in place can be a dangerous exercise since the original raw data would get overwritten and there would be no way to audit this process or recover from mistakes made during this time. A good data practice would be to maintain the original data, but use a programmatic script to clean it, fix mistakes and save that cleaned dataset for further analysis. In this lesson, you will learn all about tidy data.

Question: Consider the following data below. How many variables does this dataset contain?

males	females
4	1
2	5

[+/-] Solution

The way the table is presented, it seems like there are only two variables. However, the correct answer is 3. The variables are injured?, count and gender.

What is Tidy Data?

A dataset is said to be tidy if it satisfies the following conditions

observations are in rows
variables are in columns
contained in a single dataset.

Tidy data makes it easy to carry out data analysis.

Explore Messy Data

Let us explore some common causes of messiness by inspecting a few datasets.

Income Distribution by Religion

Our first dataset is based on a survey done by Pew Research that examines the relationship between income and religious affiliation.

Read the dataset into your R session and inspect the first few rows to assess if it is tidy.

pew <- read.delim(
  file = "http://stat405.had.co.nz/data/pew.txt",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = F
)

religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k	$75-100k	$100-150k	>150k
Agnostic	27	34	60	81	76	137	122	109	84
Atheist	12	27	37	52	35	70	73	59	74
Buddhist	27	21	30	34	33	58	62	39	53
Catholic	418	617	732	670	638	1116	949	792	633

[+/-] Solution

Except for religion, the rest of the columns headers are actually values of a lurking variable income. This dataset violates the second rule of tidy data.

Tuberculosis Incidence

tb <- read.csv(
  file = "http://stat405.had.co.nz/data/tb.csv",
  header = TRUE, 
  stringsAsFactors = FALSE
)

iso2	year	new_sp	new_sp_m04	new_sp_m514	new_sp_m014	new_sp_m1524
ZW	2004	14581			187	833
ZW	2005	13155			210	837
ZW	2006	12718			215	736
ZW	2007	10583	6	132	138	500
ZW	2008	9830			127	614

[+/-] Solution

Except for iso2 and year, the rest of the columns headers are actually values of a lurking variable, in fact combination of two lurking variables, gender and age.

Weather Data

weather <- read.delim(
 file = "http://stat405.had.co.nz/data/weather.txt",
 stringsAsFactors = FALSE
)

id	year	month	element	d1	d2	d4	d5	d6	d7
MX000017004	2010	10	TMIN				140		129
MX000017004	2010	11	TMAX		313	272	263
MX000017004	2010	11	TMIN		163	120	79
MX000017004	2010	12	TMAX	299				278
MX000017004	2010	12	TMIN	138				105

[+/-] Solution

This dataset seems to have two problems. First, it has variables in the rows in the column element. Second, it has a variable d in the column header spread across multiple columns.

Causes of Messiness

There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns.

Column headers are values, not variable names
Multiple variables are stored in one column
Variables are stored in both rows and columns
Multiple types of experimental unit stored in the same table
One type of experimental unit stored in multiple tables