I want to remove all the punctuation except for these four certain punctuation characters from a character vector: +, ., -, /
I am aware that there are similar questions, but, I have tried the corresponding solutions, and I did not get the answer I was looking for.
The current character vector, item
, has a lot of round and square brackets that I would like to get rid of.
Here is an example of what the item
variable looks like:
item
BOYS S SLV MOCK LAYER TEE
BOYS S SLV PRINTED TEE
CHEAP MONDAY TEE (SAD TOP)
LOPPAN S SLV TEE (STRIPE)
FREE PRINTED SLV LESS TEE-ZEBRALOGO & SNAKE
LST-[REVISED]
Ultimately, I would like to generate unique word frequency against the variable, item
.
word freq
boys 2
s 3
slv 4
tee 4
tee-zebralogo 1
mock 1
layer 1
printed 2
cheap 1
... ...
This is my current code using the tm
package:
item_names <- df1$item
item_names <- tolower(item_names)
item_names <- removePunctuation(item_names)
myCorpus <- Corpus(VectorSource(item_names))
myTDM <- TermDocumentMatrix(myCorpus)
findFreqTerms(myTDM)
m <- as.matrix(myTDM)
v <- sort(rowSums(m),decreasing=TRUE)
df4 <- data.frame(word = names(v),freq=v)
From the above code, I am able to reduce all the punctuation, however, I would like to preserve the above four punctuation characters but I am unable to do it satisfactorily.
I have also tried R's base functions:
item_names <- df1$item
item_names <- tolower(item_names)
item_names <- gsub(pattern = "[^[:alnum:][:space:][-\\.\\+\\/]]", "",
item_names)
item_names <- gsub(pattern = "\\s+", " ", item_names)
table(do.call(c, lapply(item_names, function(x) unlist(strsplit(x, " ")))))
df4 <- as.data.frame(table(do.call(c, lapply(item_names, function(x)
unlist(strsplit(x, c(" ")))))))
View(df4)
The immediate above code doesn't seem to work as it is still unable to eradicate punctuation characters such as (
and )
.
Eventually, I would like to remove all punctuation characters except for +, ., -, /
and generate word frequency using the above two options.
Any help would be appreciated.
Given an example:
item_names <- c(
"BOYS S SLV MOCK LAYER TEE",
"BOYS S SLV PRINTED TEE",
"CHEAP MONDAY TEE (SAD TOP)",
"LOPPAN S SLV TEE (STRIPE)",
"FREE PRINTED SLV LESS TEE-ZEBRALOGO & SNAKE",
"LST-[REVISED]",
"(lot of round and square brackets that I would like to get rid [of]. )"
)
We could do:
gsub("([-\\.\\+\\/])|[[:punct:]]", "\\1", item_names)
[1] "BOYS S SLV MOCK LAYER TEE"
[2] "BOYS S SLV PRINTED TEE"
[3] "CHEAP MONDAY TEE SAD TOP"
[4] "LOPPAN S SLV TEE STRIPE"
[5] "FREE PRINTED SLV LESS TEE-ZEBRALOGO SNAKE"
[6] "LST-REVISED"
[7] "lot of round and square brackets that I would like to get rid of. "