The split-apply-combine paradigm can be concisely summarized using the diagram below (thanks hadley!)

Base R has several functions that make this easy. Let us start by revisiting Exercise 3 from the previous lesson. This time, we will use the base function aggregate to carry out the computations. We use the formula interface provided by aggregate.
bnames2_b = mutate(bnames2_b, tot = prop * births)
result <- aggregate(formula = tot ~ name, data = bnames2_b, FUN = sum)
Exercise 1
What is the most popular name by gender between the years 2000 to 2008? (Hint: The aggregate function accepts a subset argument!) Once again, enter your guesses on Etherpad before starting out!
Solution 1
result2 <- aggregate(formula = tot ~ name + sex, data = bnames2_b, FUN = sum,
subset = (year >= 2000))
most_pop_boy <- arrange(subset(result2, sex == "boy"), desc(tot))[1, "name"]
most_pop_girl <- arrange(subset(result2, sex == "girl"), desc(tot))[1, "name"]
The most popular names between 2000 and 2008 are Jacob and Emily
So far, we have seen split-apply-combine applied in the context of data frames. However, you can think of similar problems for other data structures. Here are some examples.
list.matrix.function across multiple sets of arguments.Base R has a family of functions, popularly referred to as the apply family to carry out such operations.
apply
apply applies a function to each row or column of a matrix.
m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)
# 1 is the row index 2 is the column index
apply(m, 1, sum)
## [1] 12 14 16 18 20 22 24 26 28 30
apply(m, 2, sum)
## [1] 55 155
apply(m, 1, mean)
## [1] 6 7 8 9 10 11 12 13 14 15
apply(m, 2, mean)
## [1] 5.5 15.5
lapply
lapply applies a function to each element of a list
my_list <- list(a = 1:10, b = 2:20)
lapply(my_list, mean)
## $a
## [1] 5.5
##
## $b
## [1] 11
sapply
sapply is a more user friendly version of lapply and will return a list of matrix where appropriate. Let us work with the same list we just created.
my_list
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $b
## [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x <- sapply(my_list, mean)
x
## a b
## 5.5 11.0
class(x)
## [1] "numeric"
mapply
Its more or less a multivariate version of sapply. It applies a function to all corresponding elements of each argument.
list_1 <- list(a = c(1:10), b = c(11:20))
list_1
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $b
## [1] 11 12 13 14 15 16 17 18 19 20
list_2 <- list(c = c(21:30), d = c(31:40))
list_2
## $c
## [1] 21 22 23 24 25 26 27 28 29 30
##
## $d
## [1] 31 32 33 34 35 36 37 38 39 40
mapply(sum, list_1$a, list_1$b, list_2$c, list_2$d)
## [1] 64 68 72 76 80 84 88 92 96 100
tapply
tapply applies a function to subsets of a vector.
head(warpbreaks)
## breaks wool tension
## 1 26 A L
## 2 30 A L
## 3 54 A L
## 4 25 A L
## 5 70 A L
## 6 52 A L
with(warpbreaks, tapply(breaks, list(wool, tension), mean))
## L M H
## A 44.56 24.00 24.56
## B 28.22 28.78 18.78
by
by applies a function to subsets of a data frame.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
by(iris[, 1:2], iris[, "Species"], summary)
## iris[, "Species"]: setosa
## Sepal.Length Sepal.Width
## Min. :4.30 Min. :2.30
## 1st Qu.:4.80 1st Qu.:3.20
## Median :5.00 Median :3.40
## Mean :5.01 Mean :3.43
## 3rd Qu.:5.20 3rd Qu.:3.67
## Max. :5.80 Max. :4.40
## --------------------------------------------------------
## iris[, "Species"]: versicolor
## Sepal.Length Sepal.Width
## Min. :4.90 Min. :2.00
## 1st Qu.:5.60 1st Qu.:2.52
## Median :5.90 Median :2.80
## Mean :5.94 Mean :2.77
## 3rd Qu.:6.30 3rd Qu.:3.00
## Max. :7.00 Max. :3.40
## --------------------------------------------------------
## iris[, "Species"]: virginica
## Sepal.Length Sepal.Width
## Min. :4.90 Min. :2.20
## 1st Qu.:6.22 1st Qu.:2.80
## Median :6.50 Median :3.00
## Mean :6.59 Mean :2.97
## 3rd Qu.:6.90 3rd Qu.:3.17
## Max. :7.90 Max. :3.80
by(iris[, 1:2], iris[, "Species"], sum)
## iris[, "Species"]: setosa
## [1] 421.7
## --------------------------------------------------------
## iris[, "Species"]: versicolor
## [1] 435.3
## --------------------------------------------------------
## iris[, "Species"]: virginica
## [1] 478.1
replicate
An extremely useful function to generate datasets for simulation purposes. The final arguments turns the result into a vector or matrix if possible.
replicate(10, rnorm(10))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] -0.7645 0.3185895 -0.5689 -0.319964 1.49882 -0.1593 -0.5763
## [2,] -1.4341 -0.6606781 0.1524 -1.198101 0.66228 -2.5825 -1.4917
## [3,] -0.4506 -1.2797852 0.8286 0.996100 0.85089 -1.4561 -1.8850
## [4,] -1.1801 -0.4159011 -0.2935 2.993512 0.75832 1.1798 -1.4708
## [5,] 0.8747 0.2859087 0.3850 0.135864 -0.06600 -1.1847 -0.6427
## [6,] 0.5061 -1.7811872 0.4246 -0.150876 -0.41276 -1.8907 -1.4703
## [7,] -1.2202 -0.0261400 -1.2307 -0.002978 -0.05067 -0.8345 0.5299
## [8,] -1.9122 -0.0492036 -0.2750 1.169480 -0.07576 -0.4210 -0.2340
## [9,] 0.7003 -0.0005288 0.4755 -1.018941 0.71602 -0.4772 1.0729
## [10,] -0.9529 0.4603188 -0.8582 0.276076 1.89744 -1.4785 -0.8311
## [,8] [,9] [,10]
## [1,] 1.4185 1.11500 0.69105
## [2,] -1.0606 0.59550 1.17109
## [3,] -1.4184 0.03055 0.38472
## [4,] -0.5362 -0.16319 0.09477
## [5,] -1.1632 0.23351 -2.82787
## [6,] 0.6561 1.05047 -0.45404
## [7,] -1.9730 -0.42788 -0.52041
## [8,] 0.2142 0.06984 -0.48828
## [9,] -0.8232 0.18984 1.66038
## [10,] -0.3286 -0.23661 -0.14971
replicate(10, rnorm(10), simplify = TRUE)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.40054 0.03782 0.78123 0.5410 -1.91830 1.28884 0.13196
## [2,] -0.48184 0.45810 -1.28727 1.9267 2.04608 -0.44330 -0.68003
## [3,] -0.70475 -0.03546 -0.87877 -0.6897 -0.17008 -0.03965 0.19166
## [4,] 0.55422 1.91311 -0.30243 0.6063 0.08633 0.05310 1.73896
## [5,] 0.07958 1.85288 -0.18641 0.5110 0.43457 -0.66518 3.29714
## [6,] 1.13159 0.47895 -1.34840 -0.1361 -0.23868 0.80694 -0.58947
## [7,] 1.50151 -1.70485 0.18591 1.3231 0.02367 0.48548 0.17218
## [8,] -0.75878 0.25700 0.04495 1.1295 -0.06657 -0.68207 -1.44063
## [9,] 1.22445 -0.62720 0.61522 -0.6100 0.11585 -0.21544 -0.34115
## [10,] -1.38435 1.02435 -0.37469 -0.4636 -0.07225 -1.53654 0.04528
## [,8] [,9] [,10]
## [1,] -0.04667 0.1478 -1.5232
## [2,] 0.13871 0.2219 -0.9314
## [3,] -0.71036 -0.4076 0.3054
## [4,] -0.53646 -1.4261 0.4189
## [5,] -1.44751 1.9548 1.3638
## [6,] -0.64740 -0.3214 -0.3939
## [7,] -1.34268 1.3749 0.7859
## [8,] -0.04138 -0.9589 0.6699
## [9,] 1.64503 0.9598 1.3884
## [10,] -0.93664 0.4841 0.7852
While these functions in Base R get the job done, the inconsistent syntax often trips up users. Over the years, several alternatives have emerged, that aim to provide a simpler and more consistent syntax to operationalize split-apply-combine. In the next lesson, we will explore three packages in particular: plyr, data.table and dplyr.