Search code examples
rtokenizepurrrstrsplitdbplyr

Split string into rows with dbplyr


So I would like to split a string, in this instance on a space, into rows. I would like to do this using dbplyr, which is proving to be an issue. Obviously, if I collect there area tons of ways to do this. Although I need to do this without collecting. Here is some dummy data, one way to do with collecting, and the same way not working without collecting

 library(dbplyr)
 library(dplyr)
 library(tidyr)

 con <- DBI::dbConnect(...)

 df <- tbl(
   src = con,
   from = sql(
   "SELECT 1 AS ID, 'I LIKE CATS' AS TEXT FROM DUAL
   UNION ALL
   SELECT 2 AS ID, 'I LIVE IN A BIG HOUSE' AS TEXT FROM DUAL
   UNION ALL
   SELECT 3 AS ID, 'ARE THERE MANY ORANGES IN THE CUPBOARD' AS TEXT FROM DUAL")
 )

 df %>% 
   collect() %>% 
   mutate(WORD = strsplit(TEXT, " ")) %>% 
   unnest(WORD)

 df %>% 
   mutate(WORD = strsplit(TEXT, " "))

There was a similar (ish) question that went unanswered here: How to split one SQL column into multiple columns in r

There is the use of purrr::map and purrr::reduce with dbplyr to union function outputs that I've tried to use, but still to no avail: Using purrr::map2() with dbplyr

Any help would be massively appreciated, thanks!


Solution

  • I recommend starting by locating the spaces " " in your text. If you know the location of the spaces in each string, then it is straightforward to fetch just the text in between the spaces using substr which dbplyr can translate into SQL.

    Finding the location of all the spaces in your text, when there is a non-constant number, is non-trivial to do in SQL. The best way appears to via nested SQL query (as per this answer) which you can not write using dbplyr translation.

    One alternative that you can use is to find the first space, trim the string and iterate:

    library(dplyr)
    library(dbplyr)
    library(stringr)
    
    out1 <- df %>%
      mutate(first_space = str_locate(' ', TEXT)) %>%
      mutate(first_word = substr(TEXT, 1, first_space - 1),
             remaining_text = substr(TEXT, first_space + 1, nchar(TEXT)))
    
    keep <- out %>%
      select(ID, first_word) %>%
      rename(WORD = first_word)
    
    remainder <- out %>%
      select(ID, remaining_text) %>%
      filter(!is.na(remaining_text)) %>%
      rename(TEXT = remaining_text)
    

    I would then append keep to an output table, check the number of rows in remainder, replace df with remainder and iterate.

    Note: str_locate finds the index of the first space. In my testing, not every flavor of SQL had a translation defined for this. So this approach may fail depending on your database.

    However, every flavor of SQL should have an equivalent function. You should be able to enter this function directly, and because dbplyr does not have a translation defined for it the function should be passed untranslated to the server.

    For example, in SQL server CHARINDEX has the same purpose as str_locate. Hence you could use CHARINDEX(' ', TEXT) instead of str_locate(' ', TEXT).