Search code examples
rgsubtmpunctuation

Remove all except for certain punctuation characters to generate word frequency?


I want to remove all the punctuation except for these four certain punctuation characters from a character vector: +, ., -, /

I am aware that there are similar questions, but, I have tried the corresponding solutions, and I did not get the answer I was looking for.

The current character vector, item, has a lot of round and square brackets that I would like to get rid of.

Here is an example of what the item variable looks like:

item
BOYS S SLV MOCK LAYER TEE
BOYS S SLV PRINTED TEE
CHEAP MONDAY TEE (SAD TOP)
LOPPAN S SLV TEE (STRIPE)
FREE PRINTED SLV LESS TEE-ZEBRALOGO & SNAKE
LST-[REVISED]

Ultimately, I would like to generate unique word frequency against the variable, item.

word          freq
boys          2
s             3
slv           4
tee           4
tee-zebralogo 1
mock          1
layer         1
printed       2
cheap         1
...           ...

This is my current code using the tm package:

item_names <- df1$item
item_names <- tolower(item_names)
item_names <- removePunctuation(item_names)
myCorpus <- Corpus(VectorSource(item_names))
myTDM <- TermDocumentMatrix(myCorpus)
findFreqTerms(myTDM)

m <- as.matrix(myTDM)
v <- sort(rowSums(m),decreasing=TRUE)
df4 <- data.frame(word = names(v),freq=v)

From the above code, I am able to reduce all the punctuation, however, I would like to preserve the above four punctuation characters but I am unable to do it satisfactorily.

I have also tried R's base functions:

item_names <- df1$item
item_names <- tolower(item_names)
item_names <- gsub(pattern = "[^[:alnum:][:space:][-\\.\\+\\/]]", "", 
item_names)
item_names <- gsub(pattern = "\\s+", " ", item_names)

table(do.call(c, lapply(item_names, function(x) unlist(strsplit(x, " ")))))
df4 <- as.data.frame(table(do.call(c, lapply(item_names, function(x) 
unlist(strsplit(x, c(" ")))))))
View(df4)

The immediate above code doesn't seem to work as it is still unable to eradicate punctuation characters such as ( and ).

Eventually, I would like to remove all punctuation characters except for +, ., -, / and generate word frequency using the above two options.

Any help would be appreciated.


Solution

  • Given an example:

    item_names <- c(
      "BOYS S SLV MOCK LAYER TEE",
      "BOYS S SLV PRINTED TEE",
      "CHEAP MONDAY TEE (SAD TOP)",
      "LOPPAN S SLV TEE (STRIPE)",
      "FREE PRINTED SLV LESS TEE-ZEBRALOGO & SNAKE",
      "LST-[REVISED]",
      "(lot of round and square brackets that I would like to get rid [of]. )"
    )
    

    We could do:

    gsub("([-\\.\\+\\/])|[[:punct:]]", "\\1", item_names)
    [1] "BOYS S SLV MOCK LAYER TEE"                                         
    [2] "BOYS S SLV PRINTED TEE"                                            
    [3] "CHEAP MONDAY TEE SAD TOP"                                          
    [4] "LOPPAN S SLV TEE STRIPE"                                           
    [5] "FREE PRINTED SLV LESS TEE-ZEBRALOGO  SNAKE"                        
    [6] "LST-REVISED"                                                       
    [7] "lot of round and square brackets that I would like to get rid of. "