Search code examples
rdataframefunctionfunctional-programmingiteration

Using apply for calculating subscale and total scores across multiple dataframes


I would like to have two functions for automatic calculation of subscale and total scores across multiple dataframes which resemble datasets for various timepoints. I have considered various similar questions on SO, but haven't found a proper solution yet.

I managed to do the calculations manually, however, I am struggling in upscaling the calculation of subscale scores and total scores (from the subscale scores) for other timepoints available from other dfs using the apply function - I hope lapply is the correct one for this purpose.

Some random data to demonstrate the problem:

set.seed(1)
df1 <- data.frame(matrix(sample(32), ncol = 8))
names(df1) <- paste(rep(c("a", "b"), each = 4), 1:4, sep = "")

set.seed(2)
df2 <- data.frame(matrix(sample(32), ncol = 8))
names(df2) <- paste(rep(c("a", "b"), each = 4), 1:4, sep = "")

To account for potential NAs and the varying corresponding number of valid data, the manual calculation of the subscale and total scores looks as follows. For the calculation of the total score, I am also referring to rowSums, since in the real data, there are more than two subscales which constitute the total score and the subscale scores are next to each other within each data.frame.

df1$sub1 <- rowSums(subset(df1, select=a1:a4), na.rm = TRUE) * ncol(subset(df1, select=a1:a4)) / 
rowSums(!is.na(subset(df1, select=a1:a4)))

df1$sub2 <- rowSums(subset(df1, select=b1:b4), na.rm = TRUE) * ncol(subset(df1, select=b1:b4)) / 
rowSums(!is.na(subset(df1, select=b1:b4)))

df1$total <- rowSums(subset(df1, select=sub1:sub2))

df1
df2

My idea on trying to iterate over multiple dataframes, was the following:

#Set up a list for the dfs 
dflist <- list(df1, df2)

#Define columns for subscale and total score calculation within each df 
subrange <- list(select(dflist, c(a1:a4, b1:b4)))
totalrange <- list(select(dflist, c(sub1, sub2)))

That’s where the trouble starts – it returns a request asking for a selection

#Set up functions for the subscale scores and total scores 
subscalefun <- function() { 
rowSums(subset(dflist, select=subrange), na.rm = TRUE) * ncol(subset(dflist, select= subrange)) / 
rowSums(!is.na(subset(dflist, select= subrange)))
}

totalfun <- function() {
rowSums(subset(dflist, select=totalrange))
}

These functions are just thought as an approach to show what I try to accomplish. I am sure there should also be a paste argument included to be writing the results to the respective df.

#Using lapply for calculation of subscale and total scores across dfs defined in dflist
lapply (dflist, subscalefun)
lapply (dflist, totalfun)

Some assistance on how to approach this task would be highly appreciated. Maybe someone also can give a good advice on how to improve in functional programming (i.e. getting from simple functions frequently introduced in tutorials to programming more complex, custom functions and obtaining the proper "vocabulary" for that).


Solution

  • Translating code to functions is easier for me to start off with mirroring the original code. So the code you would start off with would be:

    DF$sub1 <- rowSums(...)
    DF$sub2 <- rowSums(...)
    DF$total <- rowSums(...)
    

    You were on the right track with the idea of lapply(). I'm going to use an anonymous function within lapply():

    lapply(dflist
           , function(DF) {
             DF$sub1 <- rowSums(subset(DF, select = a1:a4), na.rm = TRUE)
             DF$sub2 <- rowSums(subset(DF, select = b1:b4), na.rm = TRUE)
             DF$total <- rowSums(subset(DF, select=sub1:sub2))
    
             return(DF)
           } 
           )
    
    [[1]]
      a1 a2 a3 a4 b1 b2 b3 b4 sub1 sub2 total
    1  9  6 16 14 31 24 13 21   45   89   134
    2 12 25  2  8 15  3 19 22   47   59   106
    3 18 29  5 20 28  7  1 30   72   66   138
    4 27 17  4 32 11 23 26 10   80   70   150
    
    [[2]]
      a1 a2 a3 a4 b1 b2 b3 b4 sub1 sub2 total
    1  6 27 12 16 20 30  3 14   61   67   128
    2 22 26 13 28 19 29 17 25   89   90   179
    3 18  4 23  8  7  9 31 24   53   71   124
    4  5 21 32 15  1  2 10 11   73   24    97
    

    This does not modify anything so you would have to do dflist <- lapply(dflist, ...) if you would want to keep it saved.

    One thing that isn't great about this approach is that we'd have to copy and paste a1:a4 for however many letters there are in your dataset. Since the pattern is [letter][number], we can look at the unique first characters in the dataset:

    starting_letters <- unique(substring(names(df2), 1, 1))
    starting_letters
    [1] "a" "b"
    

    And we can loop through the starting_letters vector to do get the subtotals with grep giving the column numbers that match the starting_letters:

    lapply(starting_letters, function(nam) rowSums(df2[, grep(nam, names(df2))], na.rm = T))
    
    [[1]]
    [1] 61 89 53 73
    
    [[2]]
    [1] 67 90 71 24
    

    We can also determine how many sub# there are going to be based on the length of the starting_letters vector:

    subm_names <- paste0("sub", seq_len(length(starting_letters)))
    subm_names
    [1] "sub1" "sub2
    

    And putting it all together:

    lapply(dflist
           , function(DF) {
             start_letters <- unique(substring(names(DF), 1, 1))
             sub_names <- paste0("sub", seq_len(length(start_letters)))
             DF[sub_names] <- lapply(start_letters
                                     , function(let) {
                                       match_names <- grep(let, names(DF))
                                       rowSums(DF[, match_names], na.rm = T) / length(match_names) * rowSums(!is.na(DF[, match_names]))
                                     }
             )
             # DF[sub_names] <- lapply(start_letters
                                    # , function(nam) rowSums(DF[, grep(nam, names(DF))], na.rm = T))
             DF$total <- rowSums(DF[sub_names])
    
             # DF$sub1 <- rowSums(subset(DF, select = a1:a4), na.rm = TRUE)
             # DF$sub2 <- rowSums(subset(DF, select = b1:b4), na.rm = TRUE)
             # DF$total <- rowSums(subset(DF, select=sub1:sub2))
             return(DF)
           } 
           )
    

    The advantages of this approach is that it is more dynamic. If one data.frame in the list only as the a group, it wouldn't error out. Similarly, it will scale up to data.frames with more letter groupings or number groupings.