I've been wanting to demonstrate to a friend the elegance and speed of using dplyr
's join verbs (e.g. inner_join()
) over base R and simple subsetting. Took a big DB (from the nycflights13
package), started with a simple task, and to my surprise base R and simple subsetting was up to 10 times faster! And I could only really demonstrate the elegance, not speed.
Question is: what am I missing, when does dplyr
's join verbs surpass base R and simple subsetting in performance? Do they ever?...
(P.S.: I know about data.table
's excellent performance, asking about dplyr
)
My Demo:
library(tidyverse)
library(nycflights13)
library(microbenchmark)
dim(flights)
[1] 336776 19
dim(airports)
[1] 1458 8
Task is: get the unique tailnum
s of all planes in flights where dest
ination airport tzone
was "America/New_York":
base_no_join <- function() {
unique(flights$tailnum[flights$dest %in% airports$faa[airports$tzone == "America/New_York"]])
}
dplyr_no_join <- function() {
flights %>%
filter(dest %in% (airports %>%
filter(tzone=="America/New_York") %>%
pull(faa))) %>%
pull(tailnum) %>%
unique()
}
dplyr_join <- function() {
flights %>%
inner_join(airports, by = c("dest" = "faa")) %>%
filter(tzone == "America/New_York") %>%
pull(tailnum) %>%
unique()
}
See that they give the same results:
all.equal(dplyr_join(), dplyr_no_join())
[1] TRUE
all.equal(dplyr_join(), base_no_join())
[1] TRUE
Now benchmark:
microbenchmark(base_no_join(), dplyr_no_join(), dplyr_join(), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
base_no_join() 9.7198 10.1067 13.16934 11.19465 13.4736 24.2831 10
dplyr_no_join() 21.2810 22.9710 36.04867 26.59595 34.4221 108.0677 10
dplyr_join() 60.7753 64.5726 93.86220 91.10475 119.1546 137.1721 10
Please help finding an example which shows this join's superiority if it exists.
I will build on top of @Gregor's comments and @Maurits answer to write my own answer:
First, I think @Maurits's answer somewhat misses the point by comparing base R's merge()
using a join with multiple columns. Because dplyr
's join is much faster even when joining on a second column, as I will show below.
Second, I understand that in my original use-case the answer was simply to not use a join (!). But I wish to show that a more manual approach would be slower, so I've implemented another manual join to show it:
library(tidyverse)
library(microbenchmark)
set.seed(2018)
nr <- 10^3
lst <- replicate(2, data.frame(
key1 = sample(letters[1:5], nr, replace = T),
key2 = sample(LETTERS[6:10], nr, replace = T),
value = runif(nr)), simplify = F)
without_merge <- function() {
expanded <- cbind(lst[[1]][rep(1:nrow(lst[[1]]), times = nrow(lst[[2]])), ],
lst[[2]][rep(1:nrow(lst[[2]]), each = nrow(lst[[1]])), ])
colnames(expanded) <- c("key1.x", "key2.x", "value.x", "key1.y", "key2.y","value.y")
joined <- expanded[expanded$key1.x == expanded$key1.y, -1]
colnames(joined) <- c("key2.x", "value.x", "key1", "key2.y","value.y")
joined
}
res <- microbenchmark(
dplyr_join = inner_join(lst[[1]], lst[[2]], by = c("key1")),
base_R_with_merge = merge(lst[[1]], lst[[2]], by = c("key1")),
base_R_without_merge = without_merge(),
times = 20
)
res
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr_join 5.7251 6.1108 22.35497 6.32195 7.3038 239.7486 20
# base_R_with_merge 716.2633 743.8472 813.63495 812.86165 862.3884 1024.9836 20
# base_R_without_merge 1848.9711 2009.4264 2100.98077 2097.19790 2174.8663 2365.7716 20
autoplot(res)
Now back to my data. This is the time to go to joining on multiple columns, and for that I'll use the weather
dataset which joins flights
on 5 keys: origin
, year
, month
, day
and hour
.
Task is: get the unique tailnums of all planes in flights where the temperature was above 90 F degrees.
Now dplyr
really shows its strength, both in elegance and in speed.
First, there is the naive way of doing this and the smarter, which is filtering and selecting the necessary columns before joining:
library(nycflights13)
join_naive <- function() {
flights %>%
inner_join(weather, by = c("origin", "year", "month", "day", "hour")) %>%
filter(temp > 90) %>%
select(tailnum) %>%
drop_na() %>%
pull(tailnum) %>%
unique()
}
join_smarter <- function() {
flights %>%
select(origin, year, month, day, hour, tailnum) %>%
inner_join(weather %>%
select(origin, year, month, day, hour, temp) %>%
filter(temp > 90),
by = c("origin", "year", "month", "day", "hour")) %>%
select(tailnum) %>%
drop_na() %>%
pull(tailnum) %>%
unique()
}
all.equal(join_naive(), join_smarter())
# TRUE
Next there's the R base merge
way, say we've filtered the data beforehand to give it some edge, keep only the minimum necessary:
weather_over_90 <- weather[weather$temp > 90, c("origin", "year", "month", "day", "hour")]
weather_over_90 <- weather_over_90[complete.cases(weather_over_90), ]
flights_minimum <- flights[
flights$origin %in% weather_over_90$origin &
flights$year %in% weather_over_90$year &
flights$month %in% weather_over_90$month &
flights$day %in% weather_over_90$day &
flights$hour %in% weather_over_90$hour,
c("origin", "year", "month", "day", "hour", "tailnum")
]
flights_minimum <- flights_minimum[complete.cases(flights_minimum), ]
with_merge <- function() {
unique(merge(weather_over_90, flights_minimum, by = c("origin", "year", "month", "day", "hour"))$tailnum)
}
all.equal(sort(with_merge()), sort(join_smarter()))
# TRUE
And finally the most manual way I could think of, without an actual for loop, similar to my manual way of implementing a join with a single key above:
without_merge <- function(df1, df2) {
colnames(df1) <- c("origin.x", "year.x", "month.x", "day.x", "hour.x")
colnames(df2) <- c("origin.y", "year.y", "month.y", "day.y", "hour.y", "tailnum")
expanded <- cbind(df1[rep(1:nrow(df1), times = nrow(df2)), ],
df2[rep(1:nrow(df2), each = nrow(df1)), ])
joined <- expanded[
expanded$origin.x == expanded$origin.y &
expanded$year.x == expanded$year.y &
expanded$month.x == expanded$month.y &
expanded$day.x == expanded$day.y &
expanded$hour.x == expanded$hour.y,
-c(1:5)
]
unique(joined$tailnum)
}
all.equal(sort(without_merge(weather_over_90, flights_minimum)),
sort(dplyr_join_smarter()))
# TRUE
Comparing:
res <- microbenchmark(
dplyr_join_smart = join_smarter(),
dplyr_join_naive = join_naive(),
base_R_with_merge = with_merge(),
base_R_without_merge = without_merge(weather_over_90, flights_minimum),
times = 20
)
res
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr_join_smart 19.6140 20.08890 21.71460 20.6103 21.79740 30.3105 20
# dplyr_join_naive 65.6180 69.08685 71.14589 71.1451 72.47165 81.9732 20
# base_R_with_merge 189.1192 193.81325 201.46174 197.7632 207.40575 231.0595 20
# base_R_without_merge 1763.1307 1814.44825 1871.61599 1840.4509 1884.59005 2162.0949 20
autoplot(res)
As can be seen even the naive approach to dplyr
's join is faster than base merge
, but it can be x3 times improved. And keep in mind dplyr
s runtime includes the filtering!