Search code examples
rregextweets

Remove hashtags from beginning and end of tweets in R


I am trying to remove hashtags from beginning of strings in R. For example:

 x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"

I want to remove the hashtags at the end of string which are #lateNightThoughts and #movie. Result:

 - "I didn't know it could be #boring. guess I need some fun"

I tried :

stringi::stri_replace_last_regex(x,'#\\S+',"")

but it removes only the very last hashtag.

- "I didn't know it could be #boring. guess I need some fun #movie "

Any idea how to get the expected result?

Edit:

How about removing hashtag from beginning of text ? eg:

x<- "#Thomas20 I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"

Solution

  • You may use

    >  x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
    > sub("\\s*\\B#\\w+(?:\\s*#\\w+)*\\s*$", "", x)
    [1] "I didn't know it could be #boring. guess I need some fun"
    

    Or, if you do not care about the context of the first # you want to start matching from, you may even use

    sub("(?:\\s*#\\w+)+\\s*$", "", x)
    

    See the regex demo.

    Details

    • \s* - zero or more whitespaces
    • \B - right before the current location, there can be start of string or a non-word char (this is usually used to ensure you do not match # inside a "word", so if you do not need it, you may remove this non-word boundary)
    • # - a # char
    • \w+ - 1 or more word chars (letters, digits or _)
    • (?:\s*#\w+)* - zero or more occurrences of:
      • \s* - zero or more whitespaces
      • # - a # char
      • \w+ - 1+ word chars
    • \s* - zero or more whitespaces
    • $ - end of string.