Search code examples
rcumulative-sum

Looking for a way of creating a cumulative average for individuals within a dataset


I have a dataset with multiple individuals (Cow_ID) and their daily difference in temperature looking like this:

df<-data.frame("Cow_ID"=c("192","192","192","192","201","201","201","201","652","652","652","652"),"Day"=c("1","2","3","4","1","2","3","4","1","2","3","4"),"Temp_Diff"=c("0.2","0.1","na","0.8","0.4","0.1","0.7","na","0.1","0.3","na","0.8"))

I want to calculate a cumulative mean for each individual in a new column. I need it to exclude the na's too.

I managed to get some code working to run this for the whole data set, but I could not find a way to get it to stop at the end of an individual and start again for the next one.

df<-df%>%mutate(new=replace(Temp_Diff,!is.na(Temp_Diff),cummean(Temp_Diff[!is.na(Temp_Diff)])))%>%fill(new)

Please excuse the sloppy coding, I am very new to this!


Solution

  • If it is by 'Cow_ID', then do a group by on that column. Also, make sure to convert the "na" to NA before applying the is.na as well as convert the type as the 'Temp_Diff' column was character class as a result of "na" character element which doesn't work with cummean

    library(dplyr)
    df %>% 
      na_if("na") %>%
      type.convert(as.is = TRUE) %>% 
      group_by(Cow_ID) %>% 
      mutate(new = replace(Temp_Diff, !is.na(Temp_Diff), 
          cummean(Temp_Diff[!is.na(Temp_Diff)]))) %>%
      ungroup()
    

    -output

    # A tibble: 12 × 4
       Cow_ID   Day Temp_Diff    new
        <int> <int>     <dbl>  <dbl>
     1    192     1       0.2  0.2  
     2    192     2       0.1  0.15 
     3    192     3      NA   NA    
     4    192     4       0.8  0.367
     5    201     1       0.4  0.4  
     6    201     2       0.1  0.25 
     7    201     3       0.7  0.4  
     8    201     4      NA   NA    
     9    652     1       0.1  0.1  
    10    652     2       0.3  0.2  
    11    652     3      NA   NA    
    12    652     4       0.8  0.4  
    

    Or using data.table, convert the 'data.frame' to 'data.table' (setDT), conver the 'Temp_Diff' to numeric column with as.numeric returns NA for "na" while doing the force conversion, then create the 'new' column by specifying a logical condition in i to subset only the non-NA elements, do the cummean on 'j', grouped by 'Cow_ID' and assign (:=) to a new column 'new'

    library(data.table)
     setDT(df)[, Temp_Diff := as.numeric(Temp_Diff)][!is.na(Temp_Diff), 
          new := cummean(Temp_Diff), by = Cow_ID]