Search code examples
rdplyrinner-join

How to demonstrate the power of dplyr's join verbs?


I've been wanting to demonstrate to a friend the elegance and speed of using dplyr's join verbs (e.g. inner_join()) over base R and simple subsetting. Took a big DB (from the nycflights13 package), started with a simple task, and to my surprise base R and simple subsetting was up to 10 times faster! And I could only really demonstrate the elegance, not speed.

Question is: what am I missing, when does dplyr's join verbs surpass base R and simple subsetting in performance? Do they ever?...

(P.S.: I know about data.table's excellent performance, asking about dplyr)

My Demo:

library(tidyverse)
library(nycflights13)
library(microbenchmark)

dim(flights)

[1] 336776 19

dim(airports)

[1] 1458 8

Task is: get the unique tailnums of all planes in flights where destination airport tzone was "America/New_York":

base_no_join <- function() {
  unique(flights$tailnum[flights$dest %in% airports$faa[airports$tzone == "America/New_York"]])
}

dplyr_no_join <- function() {
  flights %>%
    filter(dest %in% (airports %>%
                           filter(tzone=="America/New_York") %>%
                           pull(faa))) %>%
    pull(tailnum) %>%
    unique()
}

dplyr_join <- function() {
  flights %>%
    inner_join(airports, by = c("dest" = "faa")) %>%
    filter(tzone == "America/New_York") %>%
    pull(tailnum) %>%
    unique()
}

See that they give the same results:

all.equal(dplyr_join(), dplyr_no_join())

[1] TRUE

all.equal(dplyr_join(), base_no_join())

[1] TRUE

Now benchmark:

microbenchmark(base_no_join(), dplyr_no_join(), dplyr_join(), times = 10)
Unit: milliseconds
            expr     min      lq     mean   median       uq      max neval
  base_no_join()  9.7198 10.1067 13.16934 11.19465  13.4736  24.2831    10
 dplyr_no_join() 21.2810 22.9710 36.04867 26.59595  34.4221 108.0677    10
    dplyr_join() 60.7753 64.5726 93.86220 91.10475 119.1546 137.1721    10

Please help finding an example which shows this join's superiority if it exists.


Solution

  • I will build on top of @Gregor's comments and @Maurits answer to write my own answer:

    First, I think @Maurits's answer somewhat misses the point by comparing base R's merge() using a join with multiple columns. Because dplyr's join is much faster even when joining on a second column, as I will show below.

    Second, I understand that in my original use-case the answer was simply to not use a join (!). But I wish to show that a more manual approach would be slower, so I've implemented another manual join to show it:

    library(tidyverse)
    library(microbenchmark)
    
    set.seed(2018)
    nr <- 10^3
    lst <- replicate(2, data.frame(
      key1 = sample(letters[1:5], nr, replace = T),
      key2 = sample(LETTERS[6:10], nr, replace = T),
      value = runif(nr)), simplify = F)
    
    without_merge <- function() {
      expanded <- cbind(lst[[1]][rep(1:nrow(lst[[1]]), times = nrow(lst[[2]])), ],
                        lst[[2]][rep(1:nrow(lst[[2]]), each = nrow(lst[[1]])), ])
      colnames(expanded) <- c("key1.x", "key2.x", "value.x", "key1.y", "key2.y","value.y")
      joined <- expanded[expanded$key1.x == expanded$key1.y, -1]
      colnames(joined) <- c("key2.x", "value.x", "key1", "key2.y","value.y")
      joined
    }
    
    res <- microbenchmark(
      dplyr_join = inner_join(lst[[1]], lst[[2]], by = c("key1")),
      base_R_with_merge = merge(lst[[1]], lst[[2]], by = c("key1")),
      base_R_without_merge = without_merge(),
      times = 20
    )
    res
    # Unit: milliseconds
    #                 expr       min        lq       mean     median        uq       max neval
    #           dplyr_join    5.7251    6.1108   22.35497    6.32195    7.3038  239.7486    20
    #    base_R_with_merge  716.2633  743.8472  813.63495  812.86165  862.3884 1024.9836    20
    # base_R_without_merge 1848.9711 2009.4264 2100.98077 2097.19790 2174.8663 2365.7716    20
    
    autoplot(res)
    

    enter image description here

    Now back to my data. This is the time to go to joining on multiple columns, and for that I'll use the weather dataset which joins flights on 5 keys: origin, year, month, day and hour.

    Task is: get the unique tailnums of all planes in flights where the temperature was above 90 F degrees.

    Now dplyr really shows its strength, both in elegance and in speed.

    First, there is the naive way of doing this and the smarter, which is filtering and selecting the necessary columns before joining:

    library(nycflights13)
    
    join_naive <- function() {
      flights %>%
        inner_join(weather, by = c("origin", "year", "month", "day", "hour")) %>%
        filter(temp > 90) %>%
        select(tailnum) %>%
        drop_na() %>%
        pull(tailnum) %>%
        unique()
    }
    
    join_smarter <- function() {
      flights %>%
        select(origin, year, month, day, hour, tailnum) %>%
        inner_join(weather %>%
                     select(origin, year, month, day, hour, temp) %>%
                     filter(temp > 90),
                   by = c("origin", "year", "month", "day", "hour")) %>%
        select(tailnum) %>%
        drop_na() %>%
        pull(tailnum) %>%
        unique()
    }
    
    all.equal(join_naive(), join_smarter())
    
    # TRUE
    

    Next there's the R base merge way, say we've filtered the data beforehand to give it some edge, keep only the minimum necessary:

    weather_over_90 <- weather[weather$temp > 90, c("origin", "year", "month", "day", "hour")]
    weather_over_90 <- weather_over_90[complete.cases(weather_over_90), ]
    flights_minimum <- flights[
      flights$origin %in% weather_over_90$origin &
        flights$year %in% weather_over_90$year &
        flights$month %in% weather_over_90$month &
        flights$day %in% weather_over_90$day &
        flights$hour %in% weather_over_90$hour,
      c("origin", "year", "month", "day", "hour", "tailnum")
      ]
    flights_minimum <- flights_minimum[complete.cases(flights_minimum), ]
    
    with_merge <- function() {
      unique(merge(weather_over_90, flights_minimum, by = c("origin", "year", "month", "day", "hour"))$tailnum)
    }
    
    all.equal(sort(with_merge()), sort(join_smarter()))
    
    # TRUE
    

    And finally the most manual way I could think of, without an actual for loop, similar to my manual way of implementing a join with a single key above:

    without_merge <- function(df1, df2) {
      colnames(df1) <- c("origin.x", "year.x", "month.x", "day.x", "hour.x")
      colnames(df2) <- c("origin.y", "year.y", "month.y", "day.y", "hour.y", "tailnum")
      expanded <- cbind(df1[rep(1:nrow(df1), times = nrow(df2)), ],
                        df2[rep(1:nrow(df2), each = nrow(df1)), ])
      joined <- expanded[
        expanded$origin.x == expanded$origin.y &
          expanded$year.x == expanded$year.y &
          expanded$month.x == expanded$month.y &
          expanded$day.x == expanded$day.y &
          expanded$hour.x == expanded$hour.y,
        -c(1:5)
        ]
      unique(joined$tailnum)
    }
    
    all.equal(sort(without_merge(weather_over_90, flights_minimum)),
              sort(dplyr_join_smarter()))
    
    # TRUE
    

    Comparing:

    res <- microbenchmark(
      dplyr_join_smart = join_smarter(),
      dplyr_join_naive = join_naive(),
      base_R_with_merge = with_merge(),
      base_R_without_merge = without_merge(weather_over_90, flights_minimum),
      times = 20
    )
    res
    # Unit: milliseconds
    #                 expr       min         lq       mean    median         uq       max neval
    #     dplyr_join_smart   19.6140   20.08890   21.71460   20.6103   21.79740   30.3105    20
    #     dplyr_join_naive   65.6180   69.08685   71.14589   71.1451   72.47165   81.9732    20
    #    base_R_with_merge  189.1192  193.81325  201.46174  197.7632  207.40575  231.0595    20
    # base_R_without_merge 1763.1307 1814.44825 1871.61599 1840.4509 1884.59005 2162.0949    20
    autoplot(res)
    

    enter image description here As can be seen even the naive approach to dplyr's join is faster than base merge, but it can be x3 times improved. And keep in mind dplyrs runtime includes the filtering!