Search code examples
regexrregex-lookaroundslookbehindgrepl

grepl in R: matching impeded by intra-word dashes


I have 3 words: x, y, and z, from which two compound words can be built: x-y, and y-z.

In naturally occuring text, x, y, and z can follow each other. In the first case, I have:

text="x-y z"

And I want to detect: "x-y" but not "y z". If I do:

v=c("x-y","y z")
vv=paste("\\b",v,"\\b",sep="")
sapply(vv,grepl,text,perl=TRUE)

I get c(TRUE,TRUE). In other words, grepl does not capture the fact that y is already linked to x via the intra-word dash, and that therefore, "y z" is not actually there in the text. So I use a lookbehind after adding whitespace at the beginning of the text:

text=paste("",text,sep=" ")
vv=paste("(?<= )\\b",v,"\\b",sep="")
sapply(vv,grepl,text,perl=TRUE)

this time, I get what I want: c(TRUE, FALSE). Now, in the second case, I have:

text="x y-z"

and I want to detect "y-z" but not "x y". Adopting a symmetrical approach with a lookahead this time, I tried:

text=paste(text,"",sep=" ")
v=c("x y","y-z")
vv=paste("(?= )\\b",v,"\\b",sep="")
sapply(vv,grepl,text,perl=TRUE)

But this time I get c(FALSE,FALSE) instead of c(FALSE,TRUE) as I was expecting. The FALSE in first position is expected (the lookahead detected the presence of the intra-word dash after y and prevented matching with "x y"). But I really do not understand what is preventing the matching with "y-z".

Thanks a lot in advance for your help,


Solution

  • I think this matches the description in your comment of what you want to accomplish.

    spaceInvader <- function(a, b, text) {
      # look ahead of `a` to see if there is a space
      hasa <- grepl(paste0(a, '(?= )'), text, perl = TRUE)
      # look behind `b` to see if there is a space 
      hasb <- grepl(paste0('(?<= )', b), text, perl = TRUE)
    
      result <- c(hasa, hasb)
      names(result) <- c(a, b)
      cat('In: "', text, '"\n', sep = '')
      return(result)
    }
    
    spaceInvader('x-y', 'y z', 'x-y z')
    # In: "x-y z"
    #   x-y   y z 
    #  TRUE FALSE 
    spaceInvader('x y', 'y-z', 'x y-z')
    # In: "x y-z"
    #   x y   y-z 
    # FALSE  TRUE 
    spaceInvader('x-y', 'y z', 'x y-z')
    # In: "x y-z"
    #   x-y   y z 
    # FALSE FALSE 
    spaceInvader('x y', 'y-z', 'x-y z')
    # In: "x-y z"
    #   x y   y-z 
    # FALSE FALSE 
    

    Is this a problem?

    spaceInvader('x-y', 'y-z', 'x-y-z')
    # In: "x-y-z"
    #   x-y   y-z 
    # FALSE FALSE