Search code examples
rdplyrgroup-bycountmode

R number of grouped observations equal to the mode (by group) over time


I'm looking at how value X changes in grouped data over time (t). X starts at zero for most observations and then changes randomly as time progresses. At each time-point, I'd like to find out how many obeservations from each group have the mode value for that group - ideally excluding the zero values. The data looks something like below but with more, bigger groups and more t columns.

     group_name t1 t10 t50 t100
1            s3  0 259 187  122
2            s1 29  25  23   15
3            s3  0 259  23  122
4            s2  0  36  24   15
5            s1 29  25  23   15
6            s2  0  32  24   15

Ultimately, I would like to plot how many observations have the mode value for their respective group as a function of t but I don't know how to crunch the data for this with efficient R code.

I've seen there are several ways to calculate the mode for each group at a single time-point (e.g. here) but I don't know how to adapt these to calculate the number equal to that mode, or which would be the most efficient method to scale up for multiple t columns.

Thanks for any suggestions!


Solution

  • We could group by 'group_name', and summarise across the rest of the columns (everything()) by applying the Mode function on a subset of rows by excluding the 0 values (.[. != 0]), create a logical vector (==) with the elements of the column and get the sum to find the frequency for each column by the grouping variable

    library(dplyr)
    df1 %>%
        group_by(group_name) %>%
        summarise(across(everything(), ~ sum(Mode(.[. !=0]) == ., na.rm = TRUE)))
    # A tibble: 3 x 5
    #  group_name    t1   t10   t50  t100
    #  <chr>      <int> <int> <int> <int>
    #1 s1             2     2     2     2
    #2 s2             0     1     2     2
    #3 s3             0     2     1     2
    

    Or using data.table

    library(data.table)
    setDT(df1)[, lapply(.SD, function(x) sum(Mode(x[x != 0]) == x, na.rm = TRUE)),
                 by = group_name]
    

    where

    Mode <- function(x) {
      ux <- unique(x)
      ux[which.max(tabulate(match(x, ux)))]
    }
    

    If we need to calculate across the 't' columns, reshape to 'long' format (pivot_longer), filter out the 0 values, grouped by 'group_name', summarise with the frequency of 'Mode' values

    library(tidyr)
    df1 %>% 
      pivot_longer(cols = starts_with('t')) %>%
      filter(value != 0) %>% 
      group_by(group_name) %>% 
      summarise(n_Mode = sum(Mode(value) == value))