Search code examples
rstringtwitterstringr

How to extract characters following a pattern and remove the rest?


I'm trying to create a retweet network from raw tweet text I have. The text is formatted like this:

tweet_vector <- c("RT @person: tweet tweet tweet",
                  "RT @otherperson: tweet tweet",
                  "Tweet, this isn't a retweet, @3rdperson.",
                  "RT @4thperson: this retweet also has a mention, @mentioned")

I want to create a function that returns the following:

[1] "person"
[2] "otherperson"
[3] NA
[4] "4thperson"

I can't just use str_extract("\\@*", tweet_vector) because I don't want to catch @3rdperson


Solution

  • str_extract(tweet_vector, "(?<=@)\\w+(?=:)")
    [1] "person"      "otherperson" NA            "4thperson"  
    
    
    str_extract(tweet_vector, "(?<=RT @)\\w+")
    [1] "person"      "otherperson" NA            "4thperson"  
    
    sub(".*?@(\\w+):.*|.*", "\\1", tweet_vector)
    [1] "person"      "otherperson" ""            "4thperson"