Search code examples
regexrtweets

Identifying End-Of-Line in Regular expressions in R


I wrote a small code to extract hashtags from tweets in R

m<-c(paste("Hello! #London is gr8. #Wow"," ")) # My tweet
#m<- c("Hello! #London is gr8. #Wow")

x<- unlist(gregexpr("#(\\S+)",m))
#substring(m,x)[1]

subs<-function(x){
  return(substring(m,x+1,(x-2+regexpr(" |\\n",substring(m,x)[1]))))
}

tag<- sapply(x, subs)
#x
tag

This code didn't work without my appending the space at the end of the tweet. What could be the reason? I tried \n as well.


Solution

  • $ matches the end of a string.

    m<- c("Hello! #London is gr8. #Wow")
    
    subs<-function(x){
      return(substring(m,x+1,(x-2+regexpr(" |$",substring(m,x)[1]))))
    }
    

    With the rest of your code intact:

    > tag
    [1] "London" "Wow"