Search code examples
rlistoptimizationindexingvectorization

Fast way of getting index of match in list


Given a list a containing vectors of unequal length and a vector b containing some elements from the vectors in a, I want to get a vector of equal length to b containing the index in a where the element in b matches (this is a bad explanation I know)...

The following code does the job:

a <- list(1:3, 4:5, 6:9)
b <- c(2, 3, 5, 8)

sapply(b, function(x, list) which(unlist(lapply(list, function(y, z) z %in% y, z=x))), list=a)
[1] 1 1 2 3

Replacing the sapply with a for loop achieves the same of course

The problem is that this code will be used with list and vectors with a length above 1000. On a real life set the function takes around 15 seconds (both the for loop and the sapply).

Does anyone have an idea how to speed this up, safe for a parallel approach? I have failed to see a vectorized approach (and I cannot program in C, though that would probably be the fastest).

Edit:

Will just emphasize Aaron's elegant solution using match() which gave a speed increase in the order of 1667 times (from 15 to 0.009)

I expanded a bit on it to allow multiple matches (the return is then a list)

a <- list(1:3, 3:5, 3:7)
b <- c(3, 5)
g <- rep(seq_along(a), sapply(a, length))
sapply(b, function(x) g[which(unlist(a) %in% x)])
[[1]]
[1] 1 2 3

[[2]]
[1] 2 3

The runtime for this was 0.169 which is arguably quite slower, but on the other hand more flexible


Solution

  • Here's one possibility using match:

    a <- list(1:3, 4:5, 6:9)
    b <- c(2, 3, 5, 8)
    g <- rep(seq_along(a), sapply(a, length))
    g[match(b, unlist(a))]
    #> [1] 1 1 2 3
    

    findInterval is another option:

    findInterval(match(b, unlist(a)), cumsum(c(0, sapply(a, length))) + 1)
    #> [1] 1 1 2 3
    

    For returning a list, try this:

    a <- list(1:3, 4:5, 5:9)
    b <- c(2, 3, 5, 8, 5)
    g <- rep(seq_along(a), sapply(a, length))
    aa <- unlist(a)
    au <- unique(aa)
    af <- factor(aa, levels = au)
    gg <- split(g, af)
    gg[match(b, au)]