I have a set of regressors selected by an optimal variable selection algorithm (out of a total of 30 candidate regressor columns) run on a dataset that looks like the following:
> IVs_Selected_by_LASSO
$coefficients
[1] "X4" "X10" "X19" "X20" "X22" "X25" "X27" "X28"
And, I also have the set of actual regressors included in the structural equation describing that dataset, which initially looks like the following:
True_IVs
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20
1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1
X21 X22 X23 X24 X25 X26 X27 X28 X29 X30
1 0 1 0 0 1 1 1 1 0 1
dput(True_IVs)
structure(list(X1 = "0", X2 = "0", X3 = "0", X4 = "1", X5 = "0",
X6 = "0", X7 = "0", X8 = "0", X9 = "0", X10 = "1", X11 = "0",
X12 = "0", X13 = "1", X14 = "0", X15 = "0", X16 = "0", X17 = "0",
X18 = "0", X19 = "1", X20 = "1", X21 = "0", X22 = "1", X23 = "0",
X24 = "0", X25 = "1", X26 = "1", X27 = "1", X28 = "1", X29 = "0",
X30 = "1"), row.names = 1L, class = "data.frame")
However, using the answer to an intermediate question I asked whose answer I was hoping would allow me to finally ask this current question I am asking, I was able to transform the true structural regressors into a format that is able to be logically compared with the format IVs_Selected_by_LASSO is already in as follows:
True_Regressors <- names(True_IVs)[True_IVs == 1]
> True_Regressors
[1] "X4" "X10" "X13" "X19" "X20" "X22" "X25" "X26" "X27" "X28" "X30"
Me being an experienced Excel user, definitely with more experience there than in R, if I had columns or rows with this information in them, I would probably use either a SUMIF or an IF function (with an AUTOSUM at the bottom of the column as you do) to solve this, but I can't figure out how to do the same in R for the life of me.
For example, if I were to do the IF+AUTOSUM solution and pretending that the output of printing True_IVs is in my A column in Excel with IVs_Selected_by_LASSO in the B column, I would use
=IF(A1 = B1, 1, 0)
And since A1 would be X4 and so would B1, if I put this IF function next to them in the C column, I would get a 1 in cell C1. And after applying the AUTOSUM function at the bottom of column C, I would get 8.
Thus, for this example, I would want to get a function which returns 8 because True_IVs contains all 8 of the variables selected by the LASSO Regression. Furthermore, for this example, the maximum number possible would be 11 since True_IVs contains 11 variable names, so clearly, IVs_Selected_by_LASSO can only hope to obtain a maximum of 11 matches.
You have multiple options as mentioned by others and myself in the comments, but the issue here seems to be a disconnect between Excel logic and R logic, so I will try to provide more detailed insight as a formal answer.
As you know, your excel code:
=IF(A1 = B1, 1, 0)
Will check the single cell in A1
to see if it is equal to the single cell B1
, and that can be copied down the line.
If take the vector of column names of your data frame True_IVs
(which has 30 columns) via names(True_IVs)
and think of it as all 30 column names a being copied and pasted into a column in excel, we can check those names against another vector of names - here IVs_Selected_by_LASSO$coefficients
. Except instead of one-cell-vs-one-cell we can check if the name appears in the whole vector by using %in%
:
names(True_IVs) %in% IVs_Selected_by_LASSO$coefficients
Which checks if each value in names(True_IVs)
is in any of the values in IVs_Selected_by_LASSO$coefficients
. So, for instance, if the fist value in names(True_IVs)
is anywhere in IVs_Selected_by_LASSO$coefficients
, it will return TRUE
, else it will return FALSE
. Running this across the whole vector returns a boolean vector of length 30:
# [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
# [13] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE
# [25] TRUE FALSE TRUE TRUE FALSE FALSE
The values of which corresponds to if each element in the names(True_IVs)
is found anywhere in the IVs_Selected_by_LASSO$coefficients
vector.
Secondly, as noted in the comments, when summing R assumes TRUE == 1
and FALSE == 0
, so wrapping this in a sum
will return the total number included:
sum(names(True_IVs) %in% IVs_Selected_by_LASSO$coefficients)
# [1] 8
This provides more flexibility, particularly if the values are out of order. For instance, if we scramble the values to a random order:
IVs_Selected_by_LASSO$coefficients <- sample(IVs_Selected_by_LASSO$coefficients, length(IVs_Selected_by_LASSO$coefficients))
We get the same result:
sum(names(True_IVs) %in% IVs_Selected_by_LASSO$coefficients)
# [1] 8
There are of course multiple other ways to do this, for instance @G. Grothendieck mentions elegantly using length
and intersect
:
length(intersect(IVs_Selected_by_LASSO$coefficients, names(True_IVs))) # Thanks G. Grothendieck
# [1] 8
Or more inelegantly using length
with names(True_IVs)
instead of sum
:
length(names(True_IVs)[names(True_IVs) %in% IVs_Selected_by_LASSO$coefficients])
# [1] 8
Hope this helps!