Search code examples
stringrcontainsmatchingpartial

R - partial string matching for new variable


I have quite a big dataset which has 2 text variables A and B. length(A) <= length(B). B can be either variable A with some extra characters (without order) or can be totally different from A. So i need to to create new variable within my data table under this condition: If B contains A then C = TRUE. I believe partial string matching is more suitable for me here than normal string comparison.

My dataframe example:

Home      Pick  
Barc      Barcelona 0  
F Munch   FC munchen   
Lakers    Portland

I need to add new variable Side in this way:

Home     Pick         Side    
Barc     Barcelona 0  True  
F Munch  FC munchen   True  
Lakers   Portland     False  

i am trying to solve with this:

data_n$Side <- stringMatch(data_n$Home, data_n$Pick, normalize = "YES")

but it gives all negative results.
Hoverer

stringMatch('barcel', 'Barcelona 0', normalize='YES')    

gives needed answer. Any hints where i make mistake?


Solution

  • I'm not sure of its reliability, but agrepl, the partial pattern matching function, seems to work on your data. Assume dat is your original data, then

    ## read in the original data
    > txt <- "Home\tPick
      Barc\tBarcelona 0
      F Munch\tFC munchen
      Lakers\tPortland"
    > dat <- read.table(text = txt, sep = '\t', header = TRUE)
    ##      Home        Pick
    ## 1    Barc Barcelona 0
    ## 2 F Munch  FC munchen
    ## 3  Lakers    Portland
    

    using agrepl

    > d1 <- dat[,1]
    > d2 <- dat[,2]
    > dat$Side <- sapply(seq(nrow(dat)), function(i){
          agrepl(d1[i], d2[i], ignore.case = TRUE)
          })
    > dat
    ##      Home        Pick  Side
    ## 1    Barc Barcelona 0  TRUE
    ## 2 F Munch  FC munchen  TRUE
    ## 3  Lakers    Portland FALSE