Search code examples
rdplyraggregatemean

Aggregation and mean calculation with dplyr


I have a chunk of code that aggregates timestamps of a large dataset (see below). Each timestamp represents a tweet. The code aggregates the tweets per week, it works fine. Now, I also have a column with the sentiment value of each tweet. I would like to know if it is possible to calculate the mean sentiment of the tweets per week. It would be nice to have at the end one dataset with the amount of tweets per week and the mean sentiment of these aggregated tweets. Please let me know if you've got some hints :)

Kind regards, Daniel

weekly_counts_2 <- df_bw %>% 
  drop_na(Timestamp) %>%             
  mutate(weekly_cases = floor_date(   
    Timestamp,
    unit = "week")) %>%            
  count(weekly_cases) %>%
  tidyr::complete(                
    weekly_cases = seq.Date(          
      from = min(weekly_cases),      
      to = max(weekly_cases),         
      by = "week"),                   
    fill = list(n = 0))

Solution

  • It is difficult to verify the answer since no data has been shared but based on the description provided here is a solution that you can try.

    library(dplyr)
    library(tidyr)
    library(lubridate)
    
    weekly_counts_2 <- df_bw %>% 
      drop_na(Timestamp) %>%             
      mutate(weekly_cases = floor_date(Timestamp,unit = "week")) %>% 
      group_by(weekly_cases) %>%
      summarise(mean_sentiment = mean(sentiment_value, na.rm = TRUE),
                count = n()) %>%
      complete(weekly_cases = seq.Date(min(weekly_cases), 
                  max(weekly_cases),by = "week"), fill = list(n = 0))
    

    I have assumed the column with the sentiment value is called sentiment_value, change it accordingly to your data.