Search code examples
regexrtweets

How do I extract hashtags from tweets in R?


I know this question had been asked here and here but there was a small problem when I tried it out:

x<- str_extract("Hello peopllz! My new home is #crazy gr8! #wow", "#\S+")
Error: '\S' is an unrecognized escape in character string starting "#\S"

I changed the regex to "#(.+) ?", "#\\s", but they did not extract the hashtags.

I then tried the gsub way:

x<- gsub("[^#(.+) ?]","","Hello! #London is gr8. #Wow")

It gave: " # . #"

Any ideas where I am going wrong? I'd like my output as a vector/list of all the hashtags in the tweet(without the hashes!)

Edit: I would prefer not tokenizing the tweet, because: 1. I am not tokenizing the tweets for the rest of my program, 2. It would become a very expensive step were I to scale it to handle large volumes of tweets.


Solution

  • Use "#\\S+" instead of "#\S+".

    str_extract_all("Hello peopllz! My new home is #crazy gr8! #wow", "#\\S+")
    # [[1]]
    # [1] "#crazy" "#wow"  
    

    There are two levels of parsing going on here. Before the low level regexp function within str_extract gets the pattern you want to search for (i.e. "#\S+") it is first parsed by R. R does not recognize \S as a valid escape character and throws an error. By escaping the slash with \\ you tell R to pass the \ and S as two normal characters to the regexp function, instead of interpreting it as one escape character.

    Side track

    This can produce rather bizarre expressions. Imagine that you have a list of addresses to computers on a windows network on the form of "\\computer". To search for it you would need to type str_extract(adr, "\\\\\\w+") which would turn into "\\\w+" internally and then search for.