Search code examples
rlogical-operatorscountingboolean-logic

Creating the equivalent of a SUMIF function in Excel


I have a set of regressors selected by an optimal variable selection algorithm (out of a total of 30 candidate regressor columns) run on a dataset that looks like the following:

> IVs_Selected_by_LASSO
$coefficients
[1] "X4"  "X10" "X19" "X20" "X22" "X25" "X27" "X28"

And, I also have the set of actual regressors included in the structural equation describing that dataset, which initially looks like the following:

True_IVs
  X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20
1  0  0  0  1  0  0  0  0  0   1   0   0   1   0   0   0   0   0   1   1
  X21 X22 X23 X24 X25 X26 X27 X28 X29 X30
1   0   1   0   0   1   1   1   1   0   1


dput(True_IVs)
structure(list(X1 = "0", X2 = "0", X3 = "0", X4 = "1", X5 = "0", 
    X6 = "0", X7 = "0", X8 = "0", X9 = "0", X10 = "1", X11 = "0", 
    X12 = "0", X13 = "1", X14 = "0", X15 = "0", X16 = "0", X17 = "0", 
    X18 = "0", X19 = "1", X20 = "1", X21 = "0", X22 = "1", X23 = "0", 
    X24 = "0", X25 = "1", X26 = "1", X27 = "1", X28 = "1", X29 = "0", 
    X30 = "1"), row.names = 1L, class = "data.frame")

However, using the answer to an intermediate question I asked whose answer I was hoping would allow me to finally ask this current question I am asking, I was able to transform the true structural regressors into a format that is able to be logically compared with the format IVs_Selected_by_LASSO is already in as follows:

True_Regressors <- names(True_IVs)[True_IVs == 1]
> True_Regressors 
[1] "X4"  "X10" "X13" "X19" "X20" "X22" "X25" "X26" "X27" "X28" "X30"

Me being an experienced Excel user, definitely with more experience there than in R, if I had columns or rows with this information in them, I would probably use either a SUMIF or an IF function (with an AUTOSUM at the bottom of the column as you do) to solve this, but I can't figure out how to do the same in R for the life of me.

For example, if I were to do the IF+AUTOSUM solution and pretending that the output of printing True_IVs is in my A column in Excel with IVs_Selected_by_LASSO in the B column, I would use

=IF(A1 = B1, 1, 0)

And since A1 would be X4 and so would B1, if I put this IF function next to them in the C column, I would get a 1 in cell C1. And after applying the AUTOSUM function at the bottom of column C, I would get 8.

Thus, for this example, I would want to get a function which returns 8 because True_IVs contains all 8 of the variables selected by the LASSO Regression. Furthermore, for this example, the maximum number possible would be 11 since True_IVs contains 11 variable names, so clearly, IVs_Selected_by_LASSO can only hope to obtain a maximum of 11 matches.


Solution

  • You have multiple options as mentioned by others and myself in the comments, but the issue here seems to be a disconnect between Excel logic and R logic, so I will try to provide more detailed insight as a formal answer.

    As you know, your excel code:

    =IF(A1 = B1, 1, 0)
    

    Will check the single cell in A1 to see if it is equal to the single cell B1, and that can be copied down the line.

    If take the vector of column names of your data frame True_IVs (which has 30 columns) via names(True_IVs) and think of it as all 30 column names a being copied and pasted into a column in excel, we can check those names against another vector of names - here IVs_Selected_by_LASSO$coefficients. Except instead of one-cell-vs-one-cell we can check if the name appears in the whole vector by using %in%:

    names(True_IVs) %in% IVs_Selected_by_LASSO$coefficients
    

    Which checks if each value in names(True_IVs) is in any of the values in IVs_Selected_by_LASSO$coefficients. So, for instance, if the fist value in names(True_IVs) is anywhere in IVs_Selected_by_LASSO$coefficients, it will return TRUE, else it will return FALSE. Running this across the whole vector returns a boolean vector of length 30:

    # [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
    # [13] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE
    # [25]  TRUE FALSE  TRUE  TRUE FALSE FALSE
    

    The values of which corresponds to if each element in the names(True_IVs) is found anywhere in the IVs_Selected_by_LASSO$coefficients vector.

    Secondly, as noted in the comments, when summing R assumes TRUE == 1 and FALSE == 0, so wrapping this in a sum will return the total number included:

    sum(names(True_IVs) %in% IVs_Selected_by_LASSO$coefficients)
    # [1] 8
    

    This provides more flexibility, particularly if the values are out of order. For instance, if we scramble the values to a random order:

    IVs_Selected_by_LASSO$coefficients <- sample(IVs_Selected_by_LASSO$coefficients, length(IVs_Selected_by_LASSO$coefficients))
    

    We get the same result:

    sum(names(True_IVs) %in% IVs_Selected_by_LASSO$coefficients)
    # [1] 8
    

    There are of course multiple other ways to do this, for instance @G. Grothendieck mentions elegantly using length and intersect:

    length(intersect(IVs_Selected_by_LASSO$coefficients, names(True_IVs))) # Thanks G. Grothendieck
    # [1] 8
    

    Or more inelegantly using length with names(True_IVs) instead of sum:

    length(names(True_IVs)[names(True_IVs) %in% IVs_Selected_by_LASSO$coefficients])
    # [1] 8
    

    Hope this helps!