Search code examples
regexrtwitter

Get Twitter @Username with Regex in R


How can I use regex in R to extract Twitter usernames from a string of text?

I've tried

library(stringr)

theString <- '@foobar Foobar! and @foo (@bar) but not [email protected]'

str_extract_all(string=theString,pattern='(?:^|(?:[^-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)')

But I end up with @foobar, @foo and (@bar which contains an unwanted parenthesis.

How can I get just @foobar, @foo and @bar as output?


Solution

  • Here's one method that works in R:

    theString <- '@foobar Foobar! and @foo (@bar) but not [email protected]'
    theString1 <- unlist(strsplit(theString, " "))
    regex <- "(^|[^@\\w])@(\\w{1,15})\\b"
    idx <- grep(regex, theString1, perl = T)
    theString1[idx]
    [1] "@foobar" "@foo"    "(@bar)"
    

    If you want to use @Jerry's answer in R:

    regex <- "@([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)"
    idx <- grep(regex, theString1, perl = T)
    theString1[idx]
    [1] "@foobar" "@foo"    "(@bar)" 
    

    Both of these methods include the parenthesis that you don't want, however.

    UPDATE This will get to you start-to-finish with no parentheses or any other kind of punctuation (except underscores, since they're allowed in usernames)

    theString <- '@foobar Foobar! and @fo_o (@bar) but not [email protected]'
    theString1 <- unlist(strsplit(theString, " "))
    regex1 <- "(^|[^@\\w])@(\\w{1,15})\\b" # get strings with @
    regex2 <- "[^[:alnum:]@_]"             # remove all punctuation except _ and @
    users <- gsub(regex2, "", theString1[grep(regex1, theString1, perl = T)])
    users
    
    [1] "@foobar" "@fo_o"   "@bar"