Search code examples
rstringr

How can one take a subset of words that do not have letters in specific locations?


Having trouble getting my head around this one, but I sense the answer uses stringr::str_subset.

Here's an example of what I'm string to achieve:

word_list <- c("amber", "flora", "glide", "quake", "slant")
word_neg <- "aside"
word_list_pruned <- some_function(word_list, word_neg)

> word_list_pruned
> c("flora", "slant")

I want to take a list of words, word_list, and a word, word_neg (here, "aside"), and I want to remove all words in word_list that have letters that match/are in the same place as in word_neg.

Any ideas?


Solution

  • One option would be to use a regex approach. Given the negative word aside, we can build the following regex alternation:

    ^(?:a....|.s...|..i..|....d.|....e)$
    

    Any word which does not match this alternation should be retained as a match.

    word_list <- c("amber", "flora", "glide", "quake", "slant")
    word_neg <- "aside"
    
    patterns <- sapply(seq_along(1:5), function(x) {
        paste0(strrep(".", x - 1), substr(word_neg, x, x), strrep(".", nchar(word_neg) - x))
    })
    pattern <- paste0("^(?:", paste(patterns, collapse="|"), ")$")
    word_list_pruned <- word_list[!grepl(pattern, word_list)]
    word_list_pruned
    
    [1] "flora" "slant"
    

    The complex string manipulation inside the call to sapply() is generating the regex alternation. We simply start off with ....., and then add back one letter from the negative input word.