I have a dataframe df
and want to use the function range_frac
to perform an operation.
set.seed(137)
df <- data.frame(col1 = sample(LETTERS, 100, TRUE),
col2 = sample(-75:75, 100, TRUE),
col3 = sample(-75:75, 100, TRUE))
df$col2[c(23, 48, 78)] <- NA
df$col3[c(37, 68, 81)] <- NA
range_frac <- function(n, my_df, my_var) {
len = sum(my_df[my_var] < n, na.rm = TRUE)
len
}
I want to know the number of rows satisfying the mentioned condition in col2
and col3
separately. As I was unsuccessful to pass the column name, I passed the column index(2
, 3
). However, when I try to pass a vector for my_var
it sums up the output from individual values. How does this occur?
sapply(1:3, range_frac, my_df = df, my_var = 2)
[1] 57 57 57
sapply(1:3, range_frac, my_df = df, my_var = 3)
[1] 51 51 52
sapply(1:3, range_frac, my_df = df, my_var = 2:3)
[1] 108 108 109
Could someone provide an explanation behind the result from the third operation (i.e., 57+51, 57+51, 57+52)?
(Basically, I am trying to achieve the following output in a dyplr
-summarise
way but stuck at this point and thought I would clear my understanding of this concept).
n col2 col3
1 57 51
2 57 51
3 57 52
update: I have asked an unclear question, so updating it with more information. The solution is as below:
for each n
the solution could be understood as the evaluation of the expression
sum(df[,2:3] < n, na.rm = TRUE)
and not separately for the columns 2
&3
.
If you input 2:3
to my_var
, range_frac()
actually executes
sum(df[2:3] < n, na.rm = TRUE)
for each n
. Of course you get the number of elements less than n
in the second and third columns. One solution is to have the argument my_var
vectorized, i.e.
sapply(1:3, Vectorize(range_frac, "my_var"), my_df = df, my_var = 2:3)
# [,1] [,2] [,3]
# [1,] 48 48 48
# [2,] 49 51 51