this could be a very basic question but honestly, I tried a few solutions on those similar questions but was unable to drive success on my data. It could be because of my data or I am having a hard day and couldn't figure out anything. :(
I have a vector of sentences
vec = c("having many items", "have an apple", "item")
Also, I have a data frame to lemmatize the data
lem = data.frame(pattern = c("(items)|(item)", "(has)|(have)|(having)|(had)"), replacement = c("item", "have"))
lem$pattern = as.character(lem$pattern)
lem$replacement = as.character(lem$replacement)
I want to go through each row in the lem
data frame to form a replacement command.
Option 1:
library(stringr) #this is said to be quicker than gsub and my data has 3 mil sentences
vec <- sapply(lem, function(x) str_replace_all(vec, pattern=x$pattern, replacement = x$replacement))
Error in x$pattern : $ operator is invalid for atomic vectors
Option 2:
library(doPar)
vec <- foreach(i = 1:nrow(lem)) %dopar% {
str_replace_all(vec, pattern = lem[i, "pattern"], replacement = lem[i, "replacement"])
}
Option 2 returns a list of 2 vectors: the first one is what I want, the second one is the original, which I don't know why. Also, I tested on my machine, doPar
(though using parallel programming) is not as fast as sapply
.
Since my data is quite big (3 mil sentences), could somebody recommend an effective method to lemmatize the text data?
Another option is to create a named vector from your pattern and replacement vectors instead of a data frame, and then use str_replace_all
directly, like this:
library(stringr)
vec <- c("having many items", "has an apple", "items")
lem <- c("item", "have")
names(lem) <- c("(items)|(item)", "(has)|(have)|(having)|(had)")
str_replace_all(vec, lem)
## "have many item" "have an apple" "item"