Search code examples
rdataframedplyrdata.tablesummary

Subsetting and Summarising (sum, mean, and standard deviation) Multi-level data by month over years using data.table() and dplyr() in R


Problem

I have a data frame called 'FID' (see below) that contains the frequency of FID per month over three years.

I would like to subset my data frame by calculating the total frequency of FID over 3 years using the packages data.table and dplyr, and then I want to summarise my data so it contains:-

Summarised Data frame

  1. Month
  2. Total frequency of FID per month over 3 years
  3. Mean frequency of FID per month over 3 years
  4. Standard deviation of FID per month over 3 years
  5. Standard error of FID per month over 3 years
  6. Lower and upper confidence levels per month over 3 years

I cannot figure out how to subset the data frame by adding the grand total frequencies of FID per month over 3 years. For instance, in this example below, the total frequency for the month of January over three years is - 86 + 66 + 56 = 208, and I would like to do the same for every month.

 ###What a section of my data frame looks like      ##Desired outcome
 Year     Month       FID                             Month       FID  
 2018    January       86                             January     208
 2019    January       66                             February    176
 2020    January       56
 2018    February      76
 2019    February      55
 2020    February      45

Desired Columns for the Summarised Dataframe

Month   Total_FID    Mean_FID   SD_FID   S.E_FID  Ci_Lower  Ci_Upper

I am not an advanced R user and I have read many Stack Overflow pages and tutorials online and I cannot figure out the right procedure (see my R-code below) to produce my desired summarised data frame. I also cannot find a method to produce the upper and lower confidence intervals using the package data.table. It would be really handy to know how to do this procedure using both data.table and dplyr since I use both packages most often.

If anyone is able to help, I would be deeply appreciative.

Many thanks in advance

R-code

Using data.table

   ##Reformat into a data.table object
   FID_Table<-data.table(FID)

   ##Summary statistics
   FID.Summarised=FID_Table[, sum(FID), 
                              Month=.N,
                              Mean_FID=mean(FID),
                              SD_FID=sd(FID),
                              S.E = std.error(FID),
                              by=Month]
##Error message
Error in `[.data.table`(FID_Table, , sum(FID), Month = .N, Mean_FID = mean(FID),  : 
  unused arguments (Month = .N, Mean_FID = mean(FID), SD_FID = sd(FID), S.E = std.error(FID))

Using dplyr

   ###Create a vector to ensure the dates are in the right order
library(tidyverse) #loads dplyr and ggplot2


month_levels = c('January', 'February', 'March', 'April', 'May', 'June', 
                 'July', 'August', 'September', 'October', 'November', 'December')


 p<-FID.Counts %>%
     dplyr::mutate(Month=ordered(Month, levels=month_levels)) %>%
        group_by(Month) %>% 
             dplyr::summarise(Mean.Month=mean(FID), 
                   sd.month = sd(FID, na.rm = TRUE),
                      n_FID=n(),
                        sem=sd(FID)/sqrt(n()), 
                         ci_low=Mean.Month-1.96*sem, 
                            ci_hi=Mean.Month+1.96*sem) %>% 
                               ungroup()       

                        
##The frequency of FID per month over three years is missing (see FID.Count below)

 This code produces the summary statistics but does not subset the 
 data frame by the frequency counts of FID per month over the last 3 
 years.

 #This table below summed the FID frequencies for each month using 
  this code using data.table. 

 library(data.table)

 FID.Count=FID_Table[, sum(FID), by=(Month)]

      Month  V1
 1:   January 165
 2:  February 182
 3:     March 179
 4:     April 104
 5:       May 124
 6:      June  10
 7:      July  15
 8:    August 133
 9: September  97
10:   October  82
11:  November  75
12:  December 102

Summary Statistics produced by the R-code using dplyr

       Month Mean.Month  sd.month n_FID       sem      ci_low     ci_hi
1    January  55.000000 10.535654     3  6.082763  43.0777854 66.922215
2   February  60.666667 29.737743     3 17.169094  27.0152431 94.318090
3      March  59.666667 33.291641     3 19.220938  21.9936289 97.339704
4      April  34.666667 16.862186     3  9.735388  15.5853064 53.748027
5        May  41.333333 49.571497     3 28.620117 -14.7620965 97.428763
6       June   3.333333  5.773503     3  3.333333  -3.2000000  9.866667
7       July   5.000000  4.358899     3  2.516611   0.0674415  9.932558
8     August  44.333333 21.007935     3 12.128937  20.5606169 68.106050
9  September  32.333333 21.548395     3 12.440972   7.9490287 56.717638
10   October  27.333333 13.051181     3  7.535103  12.5645314 42.102135
11  November  25.000000 19.000000     3 10.969655   3.4994760 46.500524
12  December  34.000000  4.582576     3  2.645751  28.8143274 39.185673

Data frame: 'FID'

     structure(list(Year = c(2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 
2017L, 2017L, 2017L), Month = structure(c(5L, 4L, 8L, 1L, 9L, 
7L, 6L, 2L, 12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L, 2L, 
12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L, 2L, 12L, 11L, 
10L, 3L), .Label = c("April", "August", "December", "February", 
"January", "July", "June", "March", "May", "November", "October", 
"September"), class = "factor"), FID = c(65L, 88L, 43L, 54L, 
98L, 0L, 0L, 23L, 10L, 15L, 6L, 33L, 56L, 29L, 98L, 23L, 6L, 
10L, 7L, 65L, 53L, 41L, 25L, 30L, 44L, 65L, 38L, 27L, 20L, 0L, 
8L, 45L, 34L, 26L, 44L, 39L)), class = "data.frame", row.names = c(NA, 
-36L))

Solution

  • Answer: data.table

    ##Reformat into a data.table object
       FID_Table<-data.table(FID)
    
    ##Summarise Data
    Summarised.FID<-FID_Table[, .(FID.Freq=sum(FID),
                                  mean = mean(FID),
                                  sd=sd(FID),
                                  median=median(FID)), 
                                  by = .(Month)]
    
    ##data.table results
    
            Month FID.Freq      mean        sd median
     1:   January      165 55.000000 10.535654     56
     2:  February      182 60.666667 29.737743     65
     3:     March      179 59.666667 33.291641     43
     4:     April      104 34.666667 16.862186     27
     5:       May      124 41.333333 49.571497     20
     6:      June       10  3.333333  5.773503      0
     7:      July       15  5.000000  4.358899      7
     8:    August      133 44.333333 21.007935     45
     9: September       97 32.333333 21.548395     34
    10:   October       82 27.333333 13.051181     26
    11:  November       75 25.000000 19.000000     25
    12:  December      102 34.000000  4.582576     33
    

    Answer: dplyr

    #####Summarise Data###########
    
      ###Create a vector to ensure the dates are in the right order
         library(tidyverse) #loads dplyr and ggplot2
    
    
    month_levels = c('January', 'February', 'March', 'April', 'May', 'June', 
                     'July', 'August', 'September', 'October', 'November', 'December')
    
        ##Summarise data
    
        p<-FID.Counts %>%
             dplyr::mutate(Month=ordered(Month, levels=month_levels)) %>%
                group_by(Month) %>% 
                     dplyr::summarise(Frequency = sum(FID),
                       Mean.Month=mean(FID), 
                           sd.month = sd(FID, na.rm = TRUE),
                              n_FID=n(),
                                sem=sd(FID)/sqrt(n()), 
                                 ci_low=Mean.Month-1.96*sem, 
                                    ci_hi=Mean.Month+1.96*sem) %>% 
                                       ungroup()  
    
    ##dplyr Results
    
          Month Frequency Mean.Month  sd.month n_FID       sem      ci_low     ci_hi
    1    January       165  55.000000 10.535654     3  6.082763  43.0777854 66.922215
    2   February       182  60.666667 29.737743     3 17.169094  27.0152431 94.318090
    3      March       179  59.666667 33.291641     3 19.220938  21.9936289 97.339704
    4      April       104  34.666667 16.862186     3  9.735388  15.5853064 53.748027
    5        May       124  41.333333 49.571497     3 28.620117 -14.7620965 97.428763
    6       June        10   3.333333  5.773503     3  3.333333  -3.2000000  9.866667
    7       July        15   5.000000  4.358899     3  2.516611   0.0674415  9.932558
    8     August       133  44.333333 21.007935     3 12.128937  20.5606169 68.106050
    9  September        97  32.333333 21.548395     3 12.440972   7.9490287 56.717638
    10   October        82  27.333333 13.051181     3  7.535103  12.5645314 42.102135
    11  November        75  25.000000 19.000000     3 10.969655   3.4994760 46.500524
    12  December       102  34.000000  4.582576     3  2.645751  28.8143274 39.185673