Search code examples
rtm

R tm package's `removeWords` not removing twitter hashtags from tweets due to #


I am trying to remove hashtags from tweets using tm's function removeWords. The hashtags start with # as you know, and I want to remove these tags in their entirety. However, removeWords doesn't remove them:

> library(tm)
> removeWords(x = "WOW it is cool! #Ht https://google.com", words = c("#Ht", "https://google.com"))

[1] "WOW it is cool! #Ht "

If I remove the # from the words argument, the tag is removed:

> removeWords(x = "WOW it is cool! #Ht https://google.com", words = c("Ht", "https://google.com"))
[1] "WOW it is cool! # "

Which leaves the orphan # behind.

Why is this happening? Shouldn't the function remove the words as-is simply, or am I missing something? The manual is not very helpful here.


Solution

  • Unfortunately I can't think of a great way around it. The reason behind what you're seeing is that removeWords relies on using regular expressions with word boundaries. Unfortunately "#" doesn't count as a word boundary so it gets ignored essentially. I hope to see a better answer with a nice workaround but you might just need to do something simple like an initial pass where you replace "#" with some keyword that you add to your list of things to remove in place of the symbol and use that keyword in place of the hashtag when creating your list of words to remove.