Search code examples
rregexdplyrtidytext

How to extract key phrases following specific characters using regex in R?


I have a dataframe that looks like so:

ID | Tweet_ID | Tweet
1    12345      @sprintcare I did.
2    SPRINT     @12345 Please send us a Private Message.
3    45678      @apple My information is incorrect.
4    APPLE      @45678 What information is incorrect.

What I would like to do is some case_when statement to extract all the tweets that have the handle of the company name and ignore the numerical handles to create a new field.

Current code I'm playing around with but not succeeding with:

tweet_pattern <- " @[^0-9.-]\\w+"

Customer <- Customer %>% 
           Response_To_Comp = ifelse(str_detect(Tweet, tweet_pattern), 
                                str_extract(Tweet, tweet_pattern), 
                                NA_character_))

Desired output:

ID | Tweet_ID | Tweet                                    | Response_To_Comp
1    12345      @sprintcare I did.                         sprintcare
2    SPRINT     @12345 Please send us a Private Message.   NA
3    45678      @apple My information is incorrect.        apple
4    APPLE      @45678 What information is incorrect.      NA

Solution

  • You can use a lookbehind regex to extract the text which comes after '@' and has one or more A-Za-z characters in them.

    library(dplyr)
    library(stringr)
    
    tweet_pattern <- "(?<=@)[A-Za-z]+"
    
    df %>%mutate(Response_To_Comp = str_extract(Tweet, tweet_pattern))
    
    #  ID Tweet_ID                                    Tweet Response_To_Comp
    #1  1    12345                       @sprintcare I did.       sprintcare
    #2  2   SPRINT @12345 Please send us a Private Message.             <NA>
    #3  3    45678      @apple My information is incorrect.            apple
    #4  4    APPLE    @45678 What information is incorrect.             <NA>