Search code examples
rdplyrrtweet

Combine nest() and aggregate() in R?


Looking for some help and advice:

I harvested tweets with the rtweet package. That got me a data frame with the observations (i.e. tweets) in the rows and the variables as columns. Variables are both on the tweet level (e.g. text, likes, hashtags etc) and on account level (amount of followers, bio, etc.). I ran sentiment analysis on the tweets, which added variables with sentiment scores on the tweet level to the data frame.

To simulate how my data now looks like (in reality I have 100,000+ obs. and 115 vars):

df <- data.frame(users = c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1'),
           text = c('this is u1 first tweet', 
                    'this is another tweet', 
                    'hello hello', 
                    'hashtag tweettext',
                    'tweet text',
                    'this is u1 second tweet',
                    'this is u6 first tzeet',
                   'this is u6 second tweet',
                    'this is u6 third tweet',
                   'this is u1 third tweet'),
           likes= sample(1:10, 10),
           sentiment= rnorm(10, mean=0, sd=1),
           followers = c(111, 200, 300, 400, 500, 111, 666, 666, 666, 111),
           bio = paste0(rep('lorem ipsum', 10), " ", c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1')))
   users                    text likes   sentiment followers            bio
1     u1  this is u1 first tweet     1  0.96445407       111 lorem ipsum u1
2     u2   this is another tweet    10  1.03840459       200 lorem ipsum u2
3     u3             hello hello     7  1.76887362       300 lorem ipsum u3
4     u4       hashtag tweettext     5 -0.57165015       400 lorem ipsum u4
5     u5              tweet text     4 -1.47028289       500 lorem ipsum u5
6     u1 this is u1 second tweet     2 -1.11036644       111 lorem ipsum u1
7     u6  this is u6 first tzeet     3  0.25440339       666 lorem ipsum u6
8     u6 this is u6 second tweet     8  0.02334468       666 lorem ipsum u6
9     u6  this is u6 third tweet     9 -2.71592529       666 lorem ipsum u6
10    u1  this is u1 third tweet     6  1.18528925       111 lorem ipsum u1

Now, what I would like to do is to work on the user account level. For this, I would like to aggregate the mean scores for likes and sentiments per user and at the same time combine all the tweet texts per user together as well into one vector (or one long string is fine too). The bio's should not be combined.

In general, the aggregation is not a problem:

df%>% 
  group_by(users)%>%
  summarise(meanlikes = mean(likes),
            meansentiment = mean(sentiment))

In terms, of nesting the data I came as far as this:

data %>%
  select(-likes, -sentiment) %>%
  nest(-users, -followers, -bio)

Combining the two together in one piece of code doesn't do anything meaningful. I ran the two operations separately and used inner_join() which seems to work fine, but this method is very cumbersome as I have 115 variables.

d1<- df %>%
  select(-likes, -sentiment) %>%
  nest(-users, -followers, -bio)

d2 <- df %>%
  group_by(users)%>%
  summarise(meanlikes = mean(likes),
            meansentiment = mean(sentiment))

d1 <- d1 %>%
  inner_join(d2)

Any suggeestions?

So to be clear what I am looking for is a method / bit of code that gives me this data frame:

  users                                                                    text followers
1    u1 this is u1 first tweet, this is u1 second tweet, this is u1 third tweet       111
2    u2                                                   this is another tweet       200
3    u3                                                             hello hello       300
4    u4                                                       hashtag tweettext       400
5    u5                                                              tweet text       500
6    u6 this is u6 first tzeet, this is u6 second tweet, this is u6 third tweet       666
             bio meanlikes meansentiment
1 lorem ipsum u1  4.333333    -0.2846824
2 lorem ipsum u2  6.000000    -0.5443194
3 lorem ipsum u3  2.000000     1.8001123
4 lorem ipsum u4  4.000000     1.0114402
5 lorem ipsum u5  9.000000    -0.5637166
6 lorem ipsum u6  7.000000     1.2346833

Hope you can help me out here!


Solution

  • You can group_by users, keep first value of bio and followers since all of them are just the same. Take mean of likes and sentiment and collapse text into one comma separated string using toString.

    library(dplyr)
    
    df %>%
      group_by(users) %>%
      summarise(across(c(bio, followers), first),
                across(c(likes, sentiment), mean), 
                text = toString(text))
    
    #  users bio      followers likes sentiment text             
    #  <chr> <chr>        <dbl> <dbl>     <dbl> <chr>            
    #1 u1    lorem i…       111  6.67    0.0870 this is u1 first…
    #2 u2    lorem i…       200  8      -0.945  this is another …
    #3 u3    lorem i…       300  6       0.225  hello hello      
    #4 u4    lorem i…       400  3       0.359  hashtag tweettext
    #5 u5    lorem i…       500  5      -0.664  tweet text       
    #6 u6    lorem i…       666  4.33    0.206  this is u6 first…