I am creating a data set to compute the aggregate values for different combinations of words using regex. Each row has a unique regex value which I want to check against another dataset and find the number of times it appeared in it.
The first dataset (df1) looks like this :
word1 word2 pattern
air 10 (^|\\s)air(\\s.*)?\\s10($|\\s)
airport 20 (^|\\s)airport(\\s.*)?\\s20($|\\s)
car 30 (^|\\s)car(\\s.*)?\\s30($|\\s)
The other dataset (df2) from which I want to match this looks like
sl_no query
1 air 10
2 airport 20
3 airport 20
3 airport 20
3 car 30
The final output I want should look like word1 word2 total_occ air 10 1 airport 20 3 car 30 1
I am able to do this by using apply in R
process <-
function(x)
{
length(grep(x[["pattern"]], df2$query))
}
df1$total_occ=apply(df1,1,process)
but find it time taking since my dataset is pretty big.
I found out that "mclapply" function of "parallel" package can be used to run such things on multicores, for which I am trying to run lapply first. Its giving me error saying
lapply(df,process)
Error in x[, "pattern"] : incorrect number of dimensions
Please let me know what changes should I make to run lapply correctly.
Why not just lapply()
over the pattern
?
Here I've just pulled out your pattern
but this could just as easily be df$pattern
pattern <- c("(^|\\s)air(\\s.*)?\\s10($|\\s)",
"(^|\\s)airport(\\s.*)?\\s20($|\\s)",
"(^|\\s)car(\\s.*)?\\s30($|\\s)")
Using your data for df2
txt <- "sl_no query
1 'air 10'
2 'airport 20'
3 'airport 20'
3 'airport 20'
3 'car 30'"
df2 <- read.table(text = txt, header = TRUE)
Just iterate on pattern
directly
> lapply(pattern, grep, x = df2$query)
[[1]]
[1] 1
[[2]]
[1] 2 3 4
[[3]]
[1] 5
If you want more compact output as suggested in your question, you'll need to run lengths()
over the output returned (Thanks to @Frank for pointing out the new function lengths()
.)). Eg
lengths(lapply(pattern, grep, x = df2$query))
which gives
> lengths(lapply(pattern, grep, x = df2$query))
[1] 1 3 1
You can add this to the original data via
dfnew <- cbind(df1[, 1:2],
Count = lengths(lapply(pattern, grep, x = df2$query)))