Search code examples
rdplyrdata.tabledcast

Calculate different sets of functions passed as text strings for different columns by factor


In my R shiny App, I want to allow the user, by means of check boxes, to choose columns (instrument parameters in data), and for each of those which functions to apply (i.e. which statistics to apply):

Calculate for columns x,y,z

different functions passed as text strings, selected for that column,

do this per group (factor column with names, called 'category' here.) and

name the resulting columns (name + function-name) in the output df

Selected checkboxes will produce a named list, where the name is the column (parameter name in df), and each named list element contains a vector of text string names of the functions to apply

My list of function requests would look like this:

functionlist <- list(c1 = c('mean', 'sum'), 
                     c2 = 'Length', 
                     c3 = c('Min', 'Max'), 
                     c4 = c('mean', 'sd', 'sum'))
  • mean = average value
  • Length as in count, or .N
  • sd or colSds as in standard deviation
  • sum as in total

Therefor I'm looking for a fast was to apply different sets of functions to different columns in 1 go and get the results in the form of a dataframe

Expected result: enter image description here

The data to apply it to would be comparable to this dummy data:

library(data.table)
n = 100000
dt  = data.table(index=1:100000,
                 category = sample(letters[1:25], n, replace = T),
                 c1=rnorm(n,10000),
                 c2=rnorm(n,1000),
                 c3=rnorm(n,100),
                 c4 = rnorm(n,10)
)

p.s. it is a follow up of this question: SO question, but now with the extra complication of applying different functions to different columns

UPDATE it would be even nicer if the user can also choose which groups to calculate for.


Solution

  • First I convert to lowercase, because some of these functions as written don't exist (at least in base R). Then I set the names so the output columns will have names. Next, I lapply over the vector of function names and apply them to get(.y), which is the name of the current element in functionlist, i.e. the column name.

    Some explanations for the less standard stuff here:

    • imap from tidyverse (purrr specifically) is like lapply, but instead of writing function(x) and referring to the element in your function body as x, you can just write ~ and use .x to refer to the element in the list and .y for the name of that element.
    • imap_dfc is like imap, but it cbinds all the results together as on data frame.
    • get(x) will search for an object whose name is the character string assigned to x and return that object. So if f and x are assigned as f <- 'mean' and x = 'c1', then get(f)(get(x)) is equivalent to mean(c1).

    library(tidyverse)
    
    imap_dfc(functionlist, ~{
      .x <- tolower(.x)
      .x <- setNames(.x, paste0(.y, '_', .x))
      dt[, lapply(.x, function(f) get(f)(get(.y)))
         , by = category][,-'category']
    })[, category := unique(dt$category)]
    
    #       c1_mean   c1_sum c2_length   c3_min   c3_max   c4_mean     c4_sd   c4_sum category
    #  1:  9999.988 39689953      3969 96.32998 103.3013  9.999057 1.0047397 39686.26        i
    #  2:  9999.992 40749969      4075 96.45056 103.4090  9.990428 1.0018953 40710.99        e
    #  3:  9999.980 39769919      3977 96.93850 103.6276 10.014546 1.0231273 39827.85        p
    #  4:  9999.989 40379955      4038 96.04255 103.6632  9.999367 0.9892175 40377.45        y
    #  5: 10000.008 39550031      3955 96.25407 103.7432 10.003377 1.0169810 39563.36        d
    #  6:  9999.997 38809990      3881 96.29265 104.3206  9.993976 1.0026220 38786.62        q
    #  7: 10000.021 39910082      3991 96.50937 103.4453  9.989248 0.9877364 39867.09        o
    #  8: 10000.028 39860113      3986 96.18319 103.8271  9.996190 0.9688054 39844.81        w
    #  9: 10000.006 39860025      3986 96.89095 103.8927 10.004848 1.0097102 39879.32        n
    # 10: 10000.018 39700073      3970 96.36530 103.4446 10.012726 1.0140592 39750.52        s
    # 11:  9999.997 39939988      3994 96.40691 103.7046  9.983768 1.0036815 39875.17        u
    # 12:  9999.985 39809941      3981 96.22220 104.0094 10.026315 1.0019427 39914.76        g
    # 13: 10000.013 40330054      4033 96.29462 103.8641  9.983023 0.9943337 40261.53        x
    # 14: 10000.003 39060012      3906 96.42303 103.7972  9.997054 0.9981013 39048.49        j
    # 15: 10000.002 39640008      3964 96.30402 103.8221 10.003373 1.0198045 39653.37        h
    # 16: 10000.003 40860011      4086 96.54114 103.4493  9.978961 0.9921391 40774.04        l
    # 17:  9999.978 40829908      4083 96.18487 103.7403  9.997847 1.0126861 40821.21        f
    # 18:  9999.964 39249859      3925 96.22323 103.6110  9.994997 0.9965463 39230.36        m
    # 19:  9999.964 39959858      3996 96.37931 103.4358 10.020087 1.0149939 40040.27        v
    # 20: 10000.014 40760057      4076 96.22407 104.0107 10.015623 0.9981900 40823.68        b
    # 21:  9999.998 40019993      4002 96.77786 103.7248 10.000020 1.0099821 40020.08        k
    # 22:  9999.981 41449920      4145 96.73103 103.8441 10.032406 1.0161685 41584.32        t
    # 23: 10000.010 39320039      3932 95.83859 103.2523  9.970763 0.9953307 39205.04        a
    # 24:  9999.984 40769934      4077 96.05744 103.3219  9.995989 1.0073559 40753.65        r
    # 25: 10000.001 39720003      3972 96.51861 103.3922  9.945475 0.9816170 39503.43        c
    #       c1_mean   c1_sum c2_length   c3_min   c3_max   c4_mean     c4_sd   c4_sum category