Search code examples
rduplicatesunique

Using unique function on a specific column


I have a dataframe with Twitter data where the tweet message is in the first column (text), and number of retweets is in in the second column (retweetCount). I would like to remove rows where the tweet message is repeated.

In the past, I've used the unique function to remove duplicate observations from a dataframe. Like so, df_no_duplicates <- unique(df). But for my Twitter data, this would only remove rows where both the exact text and exact retweetCount. Can I specify for the unique function to only work on the text column? If possible, I would also like to specify the function further with the following logic: IF text is repeated in dataframe, THEN only keep the observation with the greatest retweetCount.

Here's a reproducible sample of my data (although I'm not sure if there are any repeat messages in the first 50 rows):

dput(head(df, 50))

structure(list(text = c("as always making sense of it all for us ive never felt less welcome in this country brexit  ", 
"never underestimate power of stupid people in a democracy brexit", 
"a quick guide to brexit and beyond after britain votes to quit eu  ", 
"this selfinflicted wound will be his legacy cameron falls on sword after brexit euref  ", 
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", 
"this is a very good summary no biasspinagenda of the legal ramifications of the leave result brexit ", 
"you cant make this up cornwall votes out immediately pleads to keep eu cash this was never a rehearsal ", 
"no matter the outcome brexit polls demonstrate how quickly half of any population can be convinced to vote against itself q", 
"i wouldnt mind so much but the result is based on a pack of lies and unaccountable promises democracy didnt win brexit pro", 
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", 
"absolutely brilliant poll on brexit by ", "think the brexit campaign relies on the same sort of logic that drpepper does whats the worst that can happen thingsthatarewellbrexit", 
"am baffled by nigel farages claim that brexit is a victory for real people as if the 47 voting remain are fucking smu", 
"not one of the uks problems has been solved by brexit vote migration inequality the uks centurylong decline as", 
"scotland should never leave eu  calls for new independence vote grow  brexit", 
"the most articulate take on brexit is actually this ft reader comment today ", 
"david cameron has said he is set to resign as british prime minister after uk votes to leave eu brexit ", 
"im laughing at people who voted for brexit but are complaining about the exchange rate affecting their holiday\r\nremain", 
"life is too short to wear boring shoes  brexit", "pm at buckingham palace for audience with the queen  brexit", 
"i hate people too but i dont think id vote for armageddon over it brexit", 
"text = when you send a message\r\n\r\nsext = when you send a sexy message\r\n\r\nbrexit = when you send an entire global economy to he", 
"i actually was pretty confident that the brits wouldnt vote for a brexit  didnt see this coming", 
"pm at buckingham palace for audience with the queen  brexit", 
"now just the time can say if it is the right decision brexit", 
"no matter the outcome brexit polls demonstrate how quickly half of any population can be convinced to vote against itself q", 
"that was whatever your view on brexit a superb speech hope next pm will be as good a statesman as david cameron ", 
"david cameron to step down as over 52pc of britains vote to leave the european union brexit", 
"between brexit and euro2016 england have got a few johnsons to worry about so heres a quick guideeurefresults ", 
"scotland voted overwhelmingly to remain in the eu  ", "brexit is great enough on the merits but watching the tears and tantrums is the icing on the cake ", 
"the nightmare has begun it will be a long one todays column on brexit ", 
"brexit why premier league clubs may be unable to sign foreign players under age of 18\r\n ", 
"brexit why premier league clubs may be unable to sign foreign players under age of 18\r\n ", 
"cant think about brexit without thinking about this ", "brexit likely to help rajoy win sundays election but could be nightmare for him if he gets to govern given economic fragil", 
"trump praises uk public for taking back control of country   brexit", 
"expert many feel globalisation isnt working for them yes mate thats the 999 of punters who it is not working for abc730 brexit", 
"cornwall votes against europe then expects to keep eu funding good luck with that ", 
"weve done it without a bullet being fired  nigel farage forgetting that a member of parliament was assassinated over b", 
"londoners call for capital to gain independence after brexit vote  ", 
"12 trump and brexit are direct results of pressure on working class when big companies bow down to", 
"just a reminder that the brexit newspapers were easily worth more than a 2 swing  none of the men who own them pay the", 
"i always loved gb  thought about moving there some day but the decision they made yesterday is really shocking  disa", 
"winter is coming gameofthrones brexit ", "the most articulate take on brexit is actually this ft reader comment today ", 
"aw\r\n\r\ni worry that the brexit thing will justaid tyrannys spread", 
"breaking brexit spain proposes shared sovereignty over gibraltar", 
"the entirety of scotland voted to remain you imbecile brexit ", 
"diane calling it right again \r\nthe dispossessed voted for brexit jeremy corbyn offers real change\r\nhttp"
), retweetCount = c(0, 251, 39, 0, 6462, 0, 1391, 31595, 15, 
6462, 20521, 0, 871, 10, 184, 1239, 143, 0, 0, 218, 0, 3482, 
0, 218, 0, 31595, 0, 25, 777, 14, 404, 6, 1, 0, 10756, 4, 198, 
0, 666, 12387, 609, 0, 237, 1, 0, 1239, 0, 2431, 6, 84)), .Names = c("text", 
"retweetCount"), row.names = c(NA, 50L), class = "data.frame")

Solution

  • The reprex data needs a little work - but I think this will work in general using dplyr from tidyverse:

    library(tidyverse)
    
    df2 <- df %>%
      group_by(text) %>%
      summarise(retweetCount = max(retweetCount)) %>%
      distinct()
    

    I can't test on your data so the final distinct function might not be necessary.