Search code examples
rsumfrequency

R: Count number of rows per model excluding certain variables if present


I have a table which looks like this:

modelsummary <- data.frame(term = c("(Intercept)", "month1", "month2", "RateDiff", "var1", "var2", "var3", "(Intercept)", "month1", "var1", "var2", "var3"), mod_id = c(1,1,1,1,1,1,1,2,2,2,2,2))

I want to count number of variables apart from intercept, month, ratediff in each model. My desired output is:

modelsummary <- data.frame(term = c("(Intercept)", "month1", "month2", "RateDiff", "var1", "var2", "var3", "(Intercept)", "month1", "var1", "var2", "var3"), mod_id = c(1,1,1,1,1,1,1,2,2,2,2,2), variables = c(3,3,3,3,3,3,3,3,3,3,3,3))

I tried to get a flag using:

modelsummary$dim <- apply(modelsummary[, "term"], MARGIN = 1, 
                  function(x) sum(!(x %in% c(grep("month", x), "RateDiff")), na.rm = T))

but the grep(month) doesn't work.

modelsummary$dim <- apply(modelsummary[, "term"], MARGIN = 1, 
                  function(x) sum(!(x %in% c("month", "RateDiff")), na.rm = T))

This works, but month followed by the suffix isn't captured.

I want something equivalent to ~ilike~ from sql on the variables intercept, month and RateDiff, as I do not want it to be case sensitive and want to allow for suffix and prefix on the variables. How could I achieve this?


Solution

  • Here's one way with dplyr -

    modelsummary %>% 
      mutate(
        variables = term[!grepl(pattern = "intercept|month|ratediff", tolower(term))] %>% 
          n_distinct()
      )
    
              term mod_id variables
    1  (Intercept)      1         3
    2       month1      1         3
    3       month2      1         3
    4     RateDiff      1         3
    5         var1      1         3
    6         var2      1         3
    7         var3      1         3
    8  (Intercept)      2         3
    9       month1      2         3
    10        var1      2         3
    11        var2      2         3
    12        var3      2         3
    

    Or with dplyr and stringr:

    modelsummary %>%
      mutate(
        variables = str_subset(tolower(term), "intercept|month|ratediff", TRUE) %>% 
          n_distinct()
      )
    

    Add group_by(mod_id) before the mutate if you want to count the number of variables per mod_id.

    In base R -

    modelsummary$variables <- with(modelsummary, 
                   term[!grepl(pattern = "intercept|month|ratediff", tolower(term))] %>% 
                   unique() %>% length()
                   )