Search code examples
rpartition

Average by Prod if null in R


df <- structure(list(Month = c(1, 2, 3, 1, 2, 3), 
Product = c('A', 'A','A','B','B','B'),
Amount1 = c(100,NA,200,200,400,NA),
calculated1 = c(100,150,200,200,400,300),
Amount2 = c(100,NA,300,200,NA,500),
calculated2 = c(100,200,300,200,350,500)),         
.Names=c("Month","Product","Amount1","Calculated1",
"Amount2","Calculated2"),row.names = c(NA, 6L), class = "data.frame")

I need your help on the above sample problem. I have month, product, Amount1 and Amount2 columns in R data table. Not all the months have amount information for all products as shown in above dataframe. if NA found(missing value) for any product, i want to calculate average of remaining months for respective products as shown in "Calculated" columns. Any help on this is highly appreciated, i am new learner.

Thank you!

I have tried filtering based on Amount1 and then calculate average and assign values to different data frame. Later use join to bring the values for nulls. But since i have 2 different Amount columns to work on filtering on one column would not help.


Solution

  • I used this sample data:

    df <- data.frame(Month = c(1, 2, 3, 1, 2, 3), 
                         Product = c('A', 'A','A','B','B','B'),
                         Amount1 = c(100,NA,200,200,400,NA),
                         calculated1 = c(100,150,200,200,400,300),
                         Amount2 = c(100,NA,300,200,NA,500),
                         calculated2 = c(100,200,300,200,350,500))
    

    The processing is as follows:

    library(dplyr)
    df %>% 
      group_by(Month, Product) %>% # grouping using both variables
      mutate(Calc1_M_P = mean(Amount1), # temporary variables (columns) are created
             Calc2_M_P = mean(Amount2)) %>% # mean is calculated so that if there are NAs the function returns NA
      ungroup() %>% 
      group_by(Product) %>% # The data.frame is grouped again using only Product
      mutate(Calc1_P = mean(Amount1, na.rm = T), # New tmp variables are created. NAs skipped by mean()
             Calc2_P = mean(Amount2, na.rm = T)) %>% 
      mutate(Calc1 = coalesce(Calc1_M_P, Calc1_P), # final variables are constructed. If first is present the econd is skipped otherwise the second is taken
             Calc2 = coalesce(Calc2_M_P, Calc2_P),
             Calc1_P = NULL, # drop tmp variables
             Calc2_P = NULL,
             Calc1_M_P = NULL,
             Calc2_M_P = NULL
                   )
    

    The above code returns:

    # A tibble: 6 × 8
    # Groups:   Product [2]
      Month Product Amount1 calculated1 Amount2 calculated2 Calc1 Calc2
      <dbl> <chr>     <dbl>       <dbl>   <dbl>       <dbl> <dbl> <dbl>
    1     1 A           100         100     100         100   100   100
    2     2 A            NA         150      NA         200   150   200
    3     3 A           200         200     300         300   200   300
    4     1 B           200         200     200         200   200   200
    5     2 B           400         400      NA         350   400   350
    6     3 B            NA         300     500         500   300   500
    

    You see that Calc1 == calculated1 and Calc2 == calculated2. So the result is as expected.