Search code examples
rloopsequals-operator

Vector-version / Vectorizing a for which equals loop in R


I have a vector of values, call it X, and a data frame, call it dat.fram. I want to run something like "grep" or "which" to find all the indices of dat.fram[,3] which match each of the elements of X.

This is the very inefficient for loop I have below. Notice that there are many observations in X and each member of "match.ind" can have zero or more matches. Also, dat.fram has over 1 million observations. Is there any way to use a vector function in R to make this process more efficient?

Ultimately, I need a list since I will pass the list to another function that will retrieve the appropriate values from dat.fram .

Code:

match.ind=list()

for(i in 1:150000){
    match.ind[[i]]=which(dat.fram[,3]==X[i])
}

Solution

  • UPDATE:

    Ok, wow, I just found an awesome way of doing this... it's really slick. Wondering if it's useful in other contexts...?!

    ### define v as a sample column of data - you should define v to be 
    ### the column in the data frame you mentioned (data.fram[,3]) 
    
    v = sample(1:150000, 1500000, rep=TRUE)
    
    ### now here's the trick: concatenate the indices for each possible value of v,
    ### to form mybiglist - the rownames of mybiglist give you the possible values
    ### of v, and the values in mybiglist give you the index points
    
    mybiglist = tapply(seq_along(v),v,c)
    
    ### now you just want the parts of this that intersect with X... again I'll
    ### generate a random X but use whatever X you need to
    
    X = sample(1:200000, 150000)
    mylist = mybiglist[which(names(mybiglist)%in%X)]
    

    And that's it! As a check, let's look at the first 3 rows of mylist:

    > mylist[1:3]
    
    $`1`
    [1]  401143  494448  703954  757808 1364904 1485811
    
    $`2`
    [1]  230769  332970  389601  582724  804046  997184 1080412 1169588 1310105
    
    $`4`
    [1]  149021  282361  289661  456147  774672  944760  969734 1043875 1226377
    

    There's a gap at 3, as 3 doesn't appear in X (even though it occurs in v). And the numbers listed against 4 are the index points in v where 4 appears:

    > which(X==3)
    integer(0)
    
    > which(v==3)
    [1]  102194  424873  468660  593570  713547  769309  786156  828021  870796  
    883932 1036943 1246745 1381907 1437148
    
    > which(v==4)
    [1]  149021  282361  289661  456147  774672  944760  969734 1043875 1226377
    

    Finally, it's worth noting that values that appear in X but not in v won't have an entry in the list, but this is presumably what you want anyway as they're NULL!

    Extra note: You can use the code below to create an NA entry for each member of X not in v...

    blanks = sort(setdiff(X,names(mylist)))
    mylist_extras = rep(list(NA),length(blanks))
    names(mylist_extras) = blanks
    mylist_all = c(mylist,mylist_extras)
    mylist_all = mylist_all[order(as.numeric(names(mylist_all)))]
    

    Fairly self-explanatory: mylist_extras is a list with all the additional list stuff you need (the names are the values of X not featuring in names(mylist), and the actual entries in the list are simply NA). The final two lines firstly merge mylist and mylist_extras, and then perform a reordering so that the names in mylist_all are in numeric order. These names should then match exactly the (unique) values in the vector X.

    Cheers! :)


    ORIGINAL POST BELOW... superseded by the above, obviously!

    Here's a toy example with tapply that might well run significantly quicker... I made X and d relatively small so you could see what's going on:

    X = 3:7
    n = 100
    d = data.frame(a = sample(1:10,n,rep=TRUE), b = sample(1:10,n,rep=TRUE), 
                   c = sample(1:10,n,rep=TRUE), stringsAsFactors = FALSE)
    
    tapply(X,X,function(x) {which(d[,3]==x)})