I have the following term documnet matrix and dataframe.
tdm <- c('Free', 'New', 'Limited', 'Offer')
Subject Free New Limited Offer
'Free Free Free! Clear Cover with New Phone', 0 0 0 0
'Offer ! Buy New phone and get earphone at 0 0 0 0
1000. Limited Offer!'
I want to derive the following dataframe as the output
Subject Free New Limited Offer
'Free Free Free! Clear Cover with New Phone', 1,2,3 8 NA NA
Offer ! Buy New phone and get earphone at 1000. NA 3 12 1,13
Limited Offer!'
I tried the following code and got a result but this only gives me the position of the word along a string. I need the position of the words as in Hell0 - 1 new- 2.
for(i in 1:length(tdm))
{ word.locations <-
gsub(")","",gsub("c(","",unlist(paste(gregexpr(pattern
= tdm[i], DF$Subject))), fixed = TRUE), fixed = TRUE)
df <- cbind(DF,word.locations)
}
colnames(DF) <- c("text", word)
I request someone to help.
Given the inputs:
tdm <- c('Free', 'New', 'Limited', 'Offer')
subject <- c("Free Free Free! Clear Cover with New Phone",
"Offer ! Buy New phone and get earphone at 1000. Limited Offer!")
I'd do something like:
sapply(tolower(tdm), function(x) {
lapply(strsplit(tolower(subject), "(\\s+)|(?!')(?=[[:punct:]])", perl = TRUE),
function(y) {
y <- y[nzchar(y)]
toString(grep(x, y))
})
})
## free new limited offer
## [1,] "1, 2, 3" "8" "" ""
## [2,] "" "4" "12" "1, 13"
What's going on:
tolower
on both the string to match against and the terms being matched.strsplit
to split words and punctuation into separate items in a list
element.nzchar()
.grep()
to find the location of the matches.toString()
to paste the locations together as a comma-separated string.