I have a resulting data frame which has the following data:
word freq
credit credit 790
account account 451
xxxxxxxx xxxxxxxx 430
report report 405
information information 368
reporting reporting 345
consumer consumer 331
accounts accounts 300
debt debt 170
company company 152
xxxxxx xxxxxx 147
I want to do the following:
I am using tm_map for removing the stopwords but it seems, it didn't work and I still got the unwanted words in the dataframe as above.
myCorpus <- Corpus(VectorSource(df$txt))
myStopwords <- c(stopwords('english'),"xxx", "xxxx", "xxxxx",
"XXX", "XXXX", "XXXXX", "xxxx", "xxx", "xx", "xxxxxxxx",
"xxxxxxxx", "XXXXXX", "xxxxxx", "XXXXXXX", "xxxxxxx", "XXXXXXXX", "xxxxxxxx")
myCorpus <- tm_map(myCorpus, tolower)
myCorpus<- tm_map(myCorpus,removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
myTdm <- as.matrix(TermDocumentMatrix(myCorpus))
v <- sort(rowSums(myTdm), decreasing=TRUE)
FreqMat <- data.frame(word = names(v), freq=v, row.names = F)
head(FreqMat, 10)
This code above didn't work for me for removing unwanted words from corpus.
is there any other alternative to deal with this issue?
One possibility involving dplyr
and stringr
could be:
df %>%
mutate(word = tolower(word)) %>%
filter(str_count(word, fixed("x")) <= 1)
word freq
1 credit 790
2 account 451
3 report 405
4 information 368
5 reporting 345
6 consumer 331
7 accounts 300
8 debt 170
9 company 152
Or a base R
possibility using a similar logic:
df[sapply(df[, 1],
function(x) length(grepRaw("x", tolower(x), all = TRUE, fixed = TRUE)) <= 1,
USE.NAMES = FALSE), ]