Search code examples
rdataframetwittertweets

Removing retweets from data frame in R based on text column


I pulled tweets from twitter using the academictwitter package. I would now like to remove all retweets = tweets starting with "RT" in the first column "text" (e.g. third row). You can download a similar data frame from github including tweets from Trump: https://github.com/cbail/cbail.github.io/blob/master/Trump_Tweets.Rdata

Except my data frame has no column called "is_retweet", which makes it more difficult.

The output from my data frame looks like this (I have removed some redundant columns to make it clearer):

enter image description here

Thank you in advance for any suggestions


Solution

  • You can use regular expressions to figure out which rows start with 'RT'. If your data is in a data frame called tweets, maybe something like this?

    tweets[grepl("^(?!RT)", tweets$text, perl = TRUE),]
    

    Or if you're using tidyverse:

    tweets %>% 
      filter(grepl("^(?!RT)", text, perl = TRUE))