Search code examples
rsapply

How do I run a matrix regex or grep on the outer 'product' of two string vectors in R without a nested sapply?


Let's say I have a vector of strings, and a second vector of standard words that I'm interested in finding inside those strings. For example:

 a = c("aspirin 20mg", "ibuprofen 200mg", "diclofenac 50mg x 2", "phenobarbital 100mg")
 b = c("aspirin", "acetaminophen", "morphine", "ibuprofen", "warfarin")

I want to get back a TRUE-FALSE matrix of a regex of the a vector, looking for the standard substrings in the b vector. I would love if this was a matrix of length(a) X length(b). What I naively thought would work is:

 outer(a, b, grepl)

I know that I could create a function that does a nested sapply, e.g.

 sapply(a, function(x) sapply(b, function(y) grepl(y,x)))

...but I feel like R should have something simpler that is related to the outer command. mapply feels stupid because I'd have to rep and wrap the outputs back into a matrix.


Solution

  • I am not sure you need to nest your sapply() statements. Without nesting you can do:

    sapply(b, \(x) grepl(x, a))
    #      aspirin acetaminophen morphine ibuprofen warfarin
    # [1,]    TRUE         FALSE    FALSE     FALSE    FALSE
    # [2,]   FALSE         FALSE    FALSE      TRUE    FALSE
    # [3,]   FALSE         FALSE    FALSE     FALSE    FALSE
    # [4,]   FALSE         FALSE    FALSE     FALSE    FALSE
    

    Admittedly it is then a little cumbersome to add which string they match:

    sapply(b, \(x) grepl(x, a))  |>
        data.frame()  |>
        cbind(a)
    #   aspirin acetaminophen morphine ibuprofen warfarin                   a
    # 1    TRUE         FALSE    FALSE     FALSE    FALSE        aspirin 20mg
    # 2   FALSE         FALSE    FALSE      TRUE    FALSE     ibuprofen 200mg
    # 3   FALSE         FALSE    FALSE     FALSE    FALSE diclofenac 50mg x 2
    # 4   FALSE         FALSE    FALSE     FALSE    FALSE phenobarbital 100mg
    

    However, I like the idea of using outer(). You could combine that with stringi::stri_count_fixed and setNames():

    outer(
        setNames(a, a),
        setNames(b,b), 
        stringi::stri_count_fixed
    )
    #                     aspirin acetaminophen morphine ibuprofen warfarin
    # aspirin 20mg              1             0        0         0        0
    # ibuprofen 200mg           0             0        0         1        0
    # diclofenac 50mg x 2       0             0        0         0        0
    # phenobarbital 100mg       0             0        0         0        0