Search code examples
rregextwitter

Extracting Tweets in R Based on Content (keywords)


I have a bunch of tweets that I parsed into a CSV file (so I have fields for user/text/date/latitude/longitude, etc.)

I read these tweets into a dataframe in R and did some basic visualizations (like tweet frequency over time, etc. for fun.

Now, I want to subset tweets in the dataframe that contain specific keywords. For example, for fun I wanted to be able to have one dataframe that was subset by having mentions related to "Hillary Clinton" and another for "Donald Trump" and yet another for "Drake" and "Meek Mill".

So for example, for Hillary/Trump, I would expect tweets containing the following phrases would be relevant:

"Hillary Clinton", "HillaryClinton", "hillary clinton", "hillaryclinton"

Similarly for Trump, if it contained

"Donald Trump", "DonaldTrump", "donald trump", "donaldtrump"

It'd probably grab most pertinent tweets (I assume the above filter criteria would pull things like mentions - e.g. @HillaryClinton - and hashtags - e.g. #HillaryClinton).

So, I need to subset the dataframe using different sets of keywords to pull pertinent tweets. My guess is probably to use grep but I'm not sure how to figure out the regular expression that goes into this for each of my use-cases.

Could anyone help me figure that out but also help me understand how they made the regular expression if that's possible at all :(? I don't want to come here and ask every time I need to use regex...

Thanks!

EDIT: Following the example from the first post, I tried:

hillary_df <- subset(tweets_df, grep("[hH]illary ?[Cc]linton", tweets_df$text, value=FALSE))

But this only returns the specific cells in the column "text" that match. I want all of the rows in the initial df with the columns in "text" that match.

EDIT2: D'oh, needed to use brackets to subset.

hillary_df <- tweet_df[grep("[hH]illary ?[Cc]linton", tweets_df$text, value=FALSE), ]

But the resulting df has a lot of values.


Solution

  • You can construct on similar lines:

    [hH]illary ?[Cc]linton
    

    Demo: https://regex101.com/r/tEcDNY/2