Search code examples
rdataframeindexingsum

R: sum columns using conditionals and row, column indexes


I have a dataframe where there are many, many NA values interspersed among the columns of interest. Here is a sample:

dput(try[, 30:34])
structure(list(adj_avg_2016 = c(25641966.4, 28919805, 44152549.8, 
4341304.9, 10588244.8, 4928398.6, 8985362.4, 832000, 2674875.2, 
20416846.8), adj_avg_2017 = c(27693728, 29815155, 46187854.8, 
4594589.9, 11021124.8, 5075010, 9436240.8, 848640, 2782821.6, 
21439365.4), adj_avg_2018 = c(31554678.4, 30710505, 45952664, 
NA, NA, NA, NA, 865920, 2894427.2, 21556703.6), adj_avg_2019 = c(33454832, 
31605855, 45590832, NA, 12319764.8, NA, NA, 900480, NA, 
21556703.6), adj_avg_2020 = c(35122520, 32501205, 46495412, NA, 
NA, NA, NA, NA, 3130445.6, 22520553.1)), class = "data.frame", row.names = c(NA, 
10L))

What I would like to do is sum each of the columns in the dataframe try based on a condition using the column index.

More specifically, I would like to look at the previous year column and sum using only the elements that are non-NA in both the current and previous year column. For example

  • For col adj_avg_2016, I want to sum all elements because it is the first column in the group
  • For col adj_avg_2017, I want to sum all elements because there are no NA values in it or in the adj_avg_2016 column.
  • For col adj_avg_2018, I want to sum using only the corresponding entries that aren't NA in either adj_avg_2017 or adj_avg_2018 (so entries 1:3, 8:10)
  • For col adj_avg_2019, I want to sum using entries 1:3, 8 & 10 because entry 5 in adj_avg_2018 is NA and entry 9 in adj_avg_2019 is NA
  • For col adj_avg_2020 I would like to sum using only entries 1:3 & 10 because entry 9 is empty in adj_avg_2019.

My desired result are columns for each year, agg_pass1_2016...agg_pass1_2020 containing the specified calculations.

Since the dataframe will continue to grow each year and I will be doing this quite a bit a function seems like the way to go. Here is what I came up with so far:

str <- function(df, overwrite = TRUE){
    nms <- grep("adj_avg_[0-9]+$", names(df), value = TRUE) # get the columns
    nms_col_idx <- grep("adj_avg_[0-9]+$", names(df), value = FALSE) # get the column indexes
    revnms <- gsub("adj_avg", "agg_pass1", nms) # names of new columns I want to add
    if(!overwrite) revnms <- setdiff(revnms, names(df))
    df[, revnms] <- lapply(df[, nms], sum, ....) # This is where I get confused.
}

My problems/questions:

  1. What is the best way to incorporate the row and column indexes into df[, revnms] in the above function to achieve the desired calculations? I know one way would involve some combination of which(!is.na(.)) but am unsure of how to put that into my function in the last row.

  2. Is there another function that could achieve my results as quickly and simply as a function?

All comments/suggestions are welcome. Thanks a lot.


Solution

  • Based on the logic, we may need to multiply with the logical matrix created with the subset of data without the first column and the last column to and get the colSums

    colSums(cbind(try[1], try[-1] * (!is.na(try[-1]) & 
             !is.na(try[-ncol(try)]))), na.rm = TRUE)
    

    -output

    adj_avg_2016 adj_avg_2017 adj_avg_2018 adj_avg_2019 adj_avg_2020 
       151481354    158894530    133534898    133108703    136639690 
    

    Or using tidyverse

    library(dplyr)
    library(stringr)
    try %>%
        summarise(across(everything(),
         ~ if(cur_column() == 'adj_avg_2016') sum(., na.rm = TRUE) 
         else sum(.x[!is.na(.x) &  
          !is.na(get(str_replace(cur_column(), "(\\d+)$",
           function(x) as.numeric(x) - 1)))])))
      adj_avg_2016 adj_avg_2017 adj_avg_2018 adj_avg_2019 adj_avg_2020
    1    151481354    158894530    133534898    133108703    136639690
    

    If we want to create new columns, use mutate instead of summarise and also make use of .names

    try %>%
        mutate(across(everything(),
         ~ if(cur_column() == 'adj_avg_2016') sum(., na.rm = TRUE) 
         else sum(.x[!is.na(.x) &  
          !is.na(get(str_replace(cur_column(), "(\\d+)$",
           function(x) as.numeric(x) - 1)))]), 
          .names = "{str_replace(.col, 'avg', 'pass')}"))
    

    -output

       adj_avg_2016 adj_avg_2017 adj_avg_2018 adj_avg_2019 adj_avg_2020 adj_pass_2016 adj_pass_2017 adj_pass_2018 adj_pass_2019 adj_pass_2020
    1      25641966     27693728     31554678     33454832     35122520     151481354     158894530     133534898     133108703     136639690
    2      28919805     29815155     30710505     31605855     32501205     151481354     158894530     133534898     133108703     136639690
    3      44152550     46187855     45952664     45590832     46495412     151481354     158894530     133534898     133108703     136639690
    4       4341305      4594590           NA           NA           NA     151481354     158894530     133534898     133108703     136639690
    5      10588245     11021125           NA     12319765           NA     151481354     158894530     133534898     133108703     136639690
    6       4928399      5075010           NA           NA           NA     151481354     158894530     133534898     133108703     136639690
    7       8985362      9436241           NA           NA           NA     151481354     158894530     133534898     133108703     136639690
    8        832000       848640       865920       900480           NA     151481354     158894530     133534898     133108703     136639690
    9       2674875      2782822      2894427           NA      3130446     151481354     158894530     133534898     133108703     136639690
    10     20416847     21439365     21556704     21556704     22520553     151481354     158894530     133534898     133108703     136639690