Search code examples
rstatisticsmeanstandard-deviation

How to calculate summary stats over multiple columns in a dataframe with date/time series?


I have a dataframe where the first column is a date/time series and the other 9 columns are sites with associated water temperature. I want to calculate the monthly summary stats for each column. The summary stats I want are, monthly mean with the +/- sd temp, monthly proportion of time > 20C & 23C, monthly maximum temperature. Here is a sample df of my data

# Create a, b, c, d variables
a <- c("06-25-20 08:00:00 AM","06-25-20 08:15:00 AM",
       "06-25-20 08:30:00 AM","06-25-20 08:45:00 AM",
       "07-25-20 08:45:00 AM", "07-25-20 08:45:00 AM",
       "08-25-20 08:45:00 AM", "08-25-20 08:45:00 AM",
       "09-25-20 08:45:00 AM","09-25-20 08:45:00 AM")
b <- c(4,5,8, "N/A", 4,5,"N/A",7,7,6)
c <- c(6,10,8, "N/A", 8,5,"N/A",8,7,2)
# Join the variables to create a data frame
df <- data.frame(a,b,c)
df$a = as.POSIXlt(df$a, format="%m-%d-%y%H:%M:%S", tz = 'EST')

I started just trying to get the monthly mean which works, but for the life of me I cannot get the table to also include Standard deviation, and all the other summary stats without making a bunch of extra code. Here is code I used for mean

Monthly_2020Temp = df %>% 
  group_by(a = format(as.Date(a), '%b-%Y')) %>%
  summarise_each(funs( if(length(na.omit(.))>=15)
    mean(., na.rm=TRUE) else NA_real_), 
    b:c)

Here is the code I tried to use when adding standard deviation

Monthly_2020Temp = df %>% 
  group_by(a = format(as.Date(a), '%b-%Y')) %>%
  summarise_each(funs( if(length(na.omit(.))>=15)
    mean(., na.rm=TRUE) else NA_real_), sd(., na.rm=TRUE) else NA_real_), 
    b:c)

but I get an error

Error: unexpected 'else' in:
"  summarise_each(funs( if(length(na.omit(.))>=15)
    mean(., na.rm=TRUE) else NA_real_), sd(., na.rm=TRUE) else"
>     b:c)
Error: unexpected ')' in "    b:c)"

Can someone help me make a nice table of summary stats for my time series data?


Solution

  • It's best to get the data into "long" format when using dplyr summarize. Here's one possible approach to get you started:

    df$b <- as.numeric(df$b)
    df$c <- as.numeric(df$c)
    
    df %>% pivot_longer(-c(a)) %>% 
      mutate(month = lubridate::month(a), year = lubridate::year(a)) %>% 
      group_by(month, year) %>% 
      summarize(avg = mean(value, na.rm = TRUE), sd =    sd(value, na.rm = TRUE))