Search code examples
sqlrdplyrstringrdbplyr

Using string matching like grepl in a dbplyr pipeline


dbplyr is very handy as it convert dplyr code into SQL. This works really well except when it doesn't. For example i am trying to subset rows by partially matching a string against values in a column. With exception of postgres, it appears as though this isn't yet implemented in dbplyr. Am I missing some {stringr} function that would accomplish the below:

library(dplyr, warn.conflicts = FALSE)
library(DBI)
data("flights", package = "nycflights13")

con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "flights", flights)

## works in dplyr
flights %>%
  filter(grepl("N", tailnum))
#> # A tibble: 334,264 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # … with 334,254 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

## no function translation to grepl
tbl(con, "flights") %>%
  filter(grepl("N", tailnum)) %>%
  collect()
#> Error: no such function: grepl

## Also not implemented for stringr
library(stringr)
tbl(con, "flights") %>%
  filter(str_detect(tailnum, "N")) %>%
  collect()
#> Error: str_detect() is not available in this SQL variant

dbDisconnect(con)

Solution

  • We may use %like%

    tbl(con, "flights") %>%
      dplyr::filter(tailnum %like% "%N%") %>%
      collect()
    

    -output

    # A tibble: 334,264 × 19
        year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
       <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
     1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227
     2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227
     3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160
     4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183
     5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116
     6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR    ORD        150
     7  2013     1     1      555            600        -5      913            854        19 B6         507 N516JB  EWR    FLL        158
     8  2013     1     1      557            600        -3      709            723       -14 EV        5708 N829AS  LGA    IAD         53
     9  2013     1     1      557            600        -3      838            846        -8 B6          79 N593JB  JFK    MCO        140
    10  2013     1     1      558            600        -2      753            745         8 AA         301 N3ALAA  LGA    ORD        138
    # … with 334,254 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>