Search code examples
rdplyrsapply

sapply with multiple set of arguments to an user defined function


I have a dataframe df and want to use the function range_frac to perform an operation.

set.seed(137)
df <- data.frame(col1 = sample(LETTERS, 100, TRUE), 
                 col2 = sample(-75:75, 100, TRUE), 
                 col3 = sample(-75:75, 100, TRUE))

df$col2[c(23, 48, 78)] <- NA
df$col3[c(37, 68, 81)] <- NA


range_frac <- function(n, my_df, my_var) {

  len = sum(my_df[my_var] < n, na.rm = TRUE)
  len
}

I want to know the number of rows satisfying the mentioned condition in col2 and col3 separately. As I was unsuccessful to pass the column name, I passed the column index(2, 3). However, when I try to pass a vector for my_var it sums up the output from individual values. How does this occur?

sapply(1:3, range_frac, my_df = df, my_var = 2) 
[1] 57 57 57

sapply(1:3, range_frac, my_df = df, my_var = 3) 
[1] 51 51 52

sapply(1:3, range_frac, my_df = df, my_var = 2:3) 
[1] 108 108 109

Could someone provide an explanation behind the result from the third operation (i.e., 57+51, 57+51, 57+52)?

(Basically, I am trying to achieve the following output in a dyplr-summarise way but stuck at this point and thought I would clear my understanding of this concept).

n col2 col3
1 57 51
2 57 51
3 57 52

update: I have asked an unclear question, so updating it with more information. The solution is as below:

for each n the solution could be understood as the evaluation of the expression sum(df[,2:3] < n, na.rm = TRUE) and not separately for the columns 2&3.


Solution

  • If you input 2:3 to my_var, range_frac() actually executes

    sum(df[2:3] < n, na.rm = TRUE)
    

    for each n. Of course you get the number of elements less than n in the second and third columns. One solution is to have the argument my_var vectorized, i.e.

    sapply(1:3, Vectorize(range_frac, "my_var"), my_df = df, my_var = 2:3)
    
    #      [,1] [,2] [,3]
    # [1,]   48   48   48
    # [2,]   49   51   51