Search code examples
rfunctiondataframecomparisontokenize

R: Self-created function with tokenization and %like% works only on first token


I have a data frame of two columns, with the second column (unit) mostly containing the first word of the first column (str). Please check out below:

> df <- data.frame(str = c("cups vegetable soup", "cup brown lentils", "carrot", "stalks celery"), unit = c("cups", "cup", NA, "stalks"), stringsAsFactors = FALSE)

> df
                  str   unit
1 cups vegetable soup   cups
2   cup brown lentils    cup
3              carrot   <NA>
4       stalks celery stalks

I want to erase the first word of $str if it matches the corresponding value (on the same row) over at $unit.

For that scope I created the function "DelFunction" depicted below:

 DelFunction <- function(x, y) {
  tokens_x <- x[[1]]
  tokens_y <- y[[1]]
  if ((tokens_x %like% tokens_y) == TRUE) {
    regmatches(tokens_x, regexpr("[a-z]+", tokens_x)) <- ""
  }
  tokens_x
}

Following this, I used sapply on the respective row

df$str<- sapply(df$str, DelFunction, df$unit)

I get the following result, as you will see, the code just works for the first row, where the word "cups" is deleted.

> df
                str   unit
1    vegetable soup   cups
2 cup brown lentils    cup
3            carrot   <NA>
4     stalks celery stalks

The goal was getting the following result

> df
                str   unit
1    vegetable soup   cups
2    brown lentils    cup
3            carrot   <NA>
4             celery stalks

Does someone know how to approach the problem?

Thanks!


Solution

  • Possible answer:

    library(stringr)
    library(dplyr, warn.conflicts = FALSE)
    
    df <-
      data.frame(
        str = c(
          "cups vegetable soup",
          "cup brown lentils",
          "carrot",
          "stalks celery"
        ),
        unit = c("cups", "cup", NA, "stalks"),
        stringsAsFactors = FALSE
      )
    
    df %>%
      mutate(str = trimws(str_replace(str, unit, ''))) %>%
      mutate(str = if_else(is.na(unit), df$str, str)) -> df2
    
    df2
    #>              str   unit
    #> 1 vegetable soup   cups
    #> 2  brown lentils    cup
    #> 3         carrot   <NA>
    #> 4         celery stalks
    

    Another possible answer without changing (much) your original code:

    
    library(DescTools)
    
    df <-
      data.frame(
        str = c(
          "cups vegetable soup",
          "cup brown lentils",
          "carrot",
          "stalks celery"
        ),
        unit = c("cups", "cup", NA, "stalks"),
        stringsAsFactors = FALSE
      )
    
    DelFunction <- function(x, y) {
      tokens_x <- x
      tokens_y <- paste0(y, "%")
    
      if ((tokens_x %like% tokens_y) == TRUE) {
        regmatches(tokens_x, regexpr("[a-z]+", tokens_x)) <- ""
      }
      trimws(tokens_x)
    }
    
    df$str <- sapply(df$str, DelFunction, df$unit)
    df
    #>              str   unit
    #> 1 vegetable soup   cups
    #> 2  brown lentils    cup
    #> 3         carrot   <NA>
    #> 4         celery stalks