Search code examples
rnlpsapplyparallel.foreachlemmatization

Use sapply/lapply or foreach to access data attributes R


this could be a very basic question but honestly, I tried a few solutions on those similar questions but was unable to drive success on my data. It could be because of my data or I am having a hard day and couldn't figure out anything. :(

I have a vector of sentences

vec = c("having many items", "have an apple", "item")

Also, I have a data frame to lemmatize the data

lem = data.frame(pattern = c("(items)|(item)", "(has)|(have)|(having)|(had)"), replacement = c("item", "have"))
lem$pattern = as.character(lem$pattern)
lem$replacement = as.character(lem$replacement)

I want to go through each row in the lem data frame to form a replacement command.

Option 1:

library(stringr) #this is said to be quicker than gsub and my data has 3 mil sentences   
vec <- sapply(lem, function(x) str_replace_all(vec, pattern=x$pattern, replacement = x$replacement))

Error in x$pattern : $ operator is invalid for atomic vectors 

Option 2:

library(doPar)
vec <- foreach(i = 1:nrow(lem)) %dopar% {
str_replace_all(vec, pattern = lem[i, "pattern"], replacement = lem[i, "replacement"])
}

Option 2 returns a list of 2 vectors: the first one is what I want, the second one is the original, which I don't know why. Also, I tested on my machine, doPar (though using parallel programming) is not as fast as sapply.

Since my data is quite big (3 mil sentences), could somebody recommend an effective method to lemmatize the text data?


Solution

  • Another option is to create a named vector from your pattern and replacement vectors instead of a data frame, and then use str_replace_all directly, like this:

    library(stringr)
    
    vec <- c("having many items", "has an apple", "items")
    
    lem <- c("item", "have")
    names(lem) <- c("(items)|(item)", "(has)|(have)|(having)|(had)")
    
    str_replace_all(vec, lem)
    
    ## "have many item" "have an apple"  "item"