Search code examples
rapplylapplymclapply

Replace apply function with lapply


I am creating a data set to compute the aggregate values for different combinations of words using regex. Each row has a unique regex value which I want to check against another dataset and find the number of times it appeared in it.

The first dataset (df1) looks like this :

   word1    word2               pattern
   air      10     (^|\\s)air(\\s.*)?\\s10($|\\s)
 airport    20   (^|\\s)airport(\\s.*)?\\s20($|\\s)
   car      30     (^|\\s)car(\\s.*)?\\s30($|\\s)

The other dataset (df2) from which I want to match this looks like

   sl_no    query
   1      air 10     
   2    airport 20   
   3    airport 20
   3    airport 20
   3      car 30

The final output I want should look like word1 word2 total_occ air 10 1 airport 20 3 car 30 1

I am able to do this by using apply in R

process <- 
function(x) 
{
  length(grep(x[["pattern"]], df2$query))
}           

df1$total_occ=apply(df1,1,process)

but find it time taking since my dataset is pretty big.

I found out that "mclapply" function of "parallel" package can be used to run such things on multicores, for which I am trying to run lapply first. Its giving me error saying

lapply(df,process)

Error in x[, "pattern"] : incorrect number of dimensions

Please let me know what changes should I make to run lapply correctly.


Solution

  • Why not just lapply() over the pattern?

    Here I've just pulled out your pattern but this could just as easily be df$pattern

    pattern <- c("(^|\\s)air(\\s.*)?\\s10($|\\s)",
                 "(^|\\s)airport(\\s.*)?\\s20($|\\s)",
                 "(^|\\s)car(\\s.*)?\\s30($|\\s)")
    

    Using your data for df2

    txt <- "sl_no    query
       1      'air 10'     
       2    'airport 20'   
       3    'airport 20'
       3    'airport 20'
       3      'car 30'"
    df2 <- read.table(text = txt, header = TRUE)
    

    Just iterate on pattern directly

    > lapply(pattern, grep, x = df2$query)
    [[1]]
    [1] 1
    
    [[2]]
    [1] 2 3 4
    
    [[3]]
    [1] 5
    

    If you want more compact output as suggested in your question, you'll need to run lengths() over the output returned (Thanks to @Frank for pointing out the new function lengths().)). Eg

    lengths(lapply(pattern, grep, x = df2$query))
    

    which gives

    > lengths(lapply(pattern, grep, x = df2$query))
    [1] 1 3 1
    

    You can add this to the original data via

    dfnew <- cbind(df1[, 1:2],
                   Count = lengths(lapply(pattern, grep, x = df2$query)))