Search code examples
raggregateproportions

Proportion of non-zero values by multiple columns/groups


I have a question similar to Calculate proportion of positives values by group, but for grouping the average fraction by many columns. I'd like to get the proportion of non-zero values in "num" by "Year" and "season". Something that works for n# of columns, no matter where they are in the df in relation to each other.

My data:

> head(df)
# A tibble: 6 x 6
  Year  Month   Day Station   num season
  <fct> <dbl> <dbl>   <dbl> <dbl> <fct> 
1 2017      1     3     266     4 DRY   
2 2018      1     3     270     2 DRY   
3 2018      1     3     301     1 DRY   
4 2018      1     4     314     0 DRY   
5 2018      2     4     402     0 DRY   
6 2018      1     4     618     0 WET 

I thought something like this would work, but I get a warning message:

> aggregate(df$num>0~df[,c(1,6)],FUN=mean) # Average proportion of num > 0 per year & season

Error in model.frame.default(formula = env_subset$den > 0 ~ env_subset[,  : 
  invalid type (list) for variable 'env_subset[, c(1, 6)]'

Solution

  • With dplyr, I think this is what you want:

    library(dplyr)
    df %>% group_by(Year, season) %>%
      summarize(prop_gt_0 = mean(num > 0), .groups = "drop")
    # # A tibble: 3 × 3
    #    Year season prop_gt_0
    #   <int> <chr>      <dbl>
    # 1  2017 DRY          1  
    # 2  2018 DRY          0.5
    # 3  2018 WET          0  
    

    It's usually better to refer to columns by name rather than by number, so, as you say it works "no matter where they are in the df".

    You can still use aggregate--I prefer the formula interface for working with column names:

    aggregate(num ~ Year + season, data = df, FUN = \(x) mean(x > 0))
    #   Year season num
    # 1 2017    DRY 1.0
    # 2 2018    DRY 0.5
    # 3 2018    WET 0.0