I wanted to know if there was a more efficient way to add a tally to a dataset in R.
Using the mpg dataset, this is how I do it using the mpg dataset as an example.
mpg %>%
group_by(manufacturer) %>%
count() %>%
right_join(
mpg
)
So essentially, I want a count of the number of unique observations in the manufacturer column. It works fine as this is quite a small dataset, but I'm working with datasets with over 100k observations and wanted to find a better way to do it than to join in this way.
To give context, the number of unique observations are used as denominators for subsequent analyses.
If you want to go fast, you can try data.table
:
library(data.table)
res <- data.table(mpg)[,':='(cnt = .N), by = manufacturer]
res
manufacturer model displ year cyl trans drv cty hwy fl class cnt
1: audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 18
2: audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 18
3: audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 18
4: audi a4 2.0 2008 4 auto(av) f 21 30 p compact 18
5: audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 18
---
230: volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize 27
231: volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize 27
232: volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize 27
233: volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize 27
234: volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize 27
Benchmark (using @phiver nice solution) :
library(dplyr)
library(microbenchmark)
microbenchmark(dplyr = mpg %>% group_by(manufacturer) %>% add_tally() ,
data.table = data.table(mpg)[,':='(cnt = .N), by = manufacturer])
Unit: milliseconds
expr min lq mean median uq max neval
dplyr 8.201807 8.557434 9.599122 9.018660 9.922339 17.425479 100
data.table 1.245440 1.370666 1.615039 1.470719 1.691733 6.391889 100