Minimizing repetition with further replication
When you first started in R you likely were writing simple code to generate one outcome.
 "Hello world!"
5 * 6
x <- c(1, 2, 3, 4, 5)
 1 2 3 4 5
This is great, you are learning about strings, math, and vectors in R!
Then you get started with some basic analyses. You want to see if you can find the mean of some numbers.
employee <- c('John Doe','Peter Gynn','Jolie Hope') salary <- c(21000, 23400, 26800) startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14')) # form dataframe and take mean of salary column employ_data <- data.frame(employee, salary, startdate) mean(employ_data$salary)
Eventually you hopefully get exposed to the tidyverse, and you find how this “opinionated collection of R packages designed for data science” makes data analysis in R easier and more readable!
mtcars %>% group_by(cyl) %>% summarize(mean(mpg))
# A tibble: 3 x 2 cyl `mean(mpg)` <dbl> <dbl> 1 4 26.7 2 6 19.7 3 8 15.1
Everything is going great! You’ve likely replaced Excel at this point, and potentially SPSS or some other statistical software suite! But then you run into a problem where you need to use a function repeatedly.
You could use something like the following code to calculate one-way ANOVAs for some dependent variables and a set independent variable:
aov_mpg <- aov(mpg ~ factor(cyl), data = mtcars) summary(aov_mpg) aov_disp <- aov(disp ~ factor(cyll), data = mtcars) summary(aov_disp) aov_hp <- aov(hp ~ factor(cyl), data = mrcars) summry(aov_hpp) aov_wt <- aov(wt ~ factor(cyl), datas = mtcars) summary(aov_wt)
But you copy-pasted code 3x, and oops you made some minor misspelling mistakes which throws an error! (The above code leads to errors!)
Also, what if you realized that you wanted to actually run these ANOVAs for number of gears instead of number of cylinders? You would have to go back and change the
factor(cyl) call to
factor(gear) 4x! This is not very efficient, and you’re more likely to end up with mistakes as you have to type everything multiple times!
How about another example.
Let’s calculate the R-squared values for the linear relationship between Weight and Miles per Gallon, according to the number of Cylinders.
I have written code below that does this for 4 cylinder cars from the
mtcars dataset. This is a worst case scenario, you know some
dplyr code (
dplyr::filter), but are not comfortable with the pipe. That’s fine, you accomplish your goal but a lot of coding! You would have to duplicate this code for 6 cylinder and 8 cylinder cars, for even more code…
library(tidyverse) # create df for 4 cylinder cars cyl_4 <- filter(mtcars, cyl == 4) # create a linear model on 4 cyl cars lm_4 <- lm(mpg ~ wt, data = cyl_4) # get the summ lm_4_summary <- summary(lm_4) # get the r.squared value lm_4cyl_r_squared <- lm_4_summary["r.squared"] # check the value lm_4cyl_r_squared
$r.squared  0.5086326
Alternatively, you could do the same thing with the pipe. A lot less typing, but to do this for all 3 subsets means we have to copy paste multiple times, so if you end up wanting to do this as a linear model of
mpg ~ disp in addition to
mpg ~ wt, you would have to duplicate the code 3 more times and change it 3 more times. This may not seem like a big deal, but eventually is a huge deal once you start to scale up the code (say 10+ times or 100+ times, etc).
# piped analysis lm_4cyl_rsquared <- mtcars %>% filter(cyl == 4) %>% lm(mpg ~ wt, data = .) %>% summary() %>% .$"r.squared" #check output lm_4cyl_r_squared
$r.squared  0.5086326
To solve this issue of minimizing repetition with further replication, we can dive straight into purrr! To read more about purrr Hadley Wickham recommends the iteration chapter from “R for Data Science” or alternatively you can look at the purrr documentation. Lastly, Jenny Bryan has a great purrr tutorial here. You can load purrr by itself, but it is also loaded as part of the tidyverse library.