Search code examples
rapplymode

R - calculating mode and percentages by mode and target


I'm trying to calculate the mode for numeric columns. The columns which are not numeric, should have a "NA" as a placeholder in the vector. I would also need percentages according to a target. Some example data:

c1= c("A", "B", "C", "C", "B", "C", "C") 
c2= factor(c(1, 1, 2, 2,1,2,1), labels = c("Y","N"))
d= as.Date(c("2015-02-01", "2015-02-03","2015-02-01","2015-02-05", "2015-02-03","2015-02-01", "2015-02-03"), format="%Y-%m-%d")
x= c(1,1,2,3,1,2,4) 
y= c(1,2,2,6,2,3,1) 
t= c(1,0,1,1,0,0,1)
df=data.frame(c1, c2, d, x, y,t) 
df

  c1 c2          d x y t
1  A  Y 2015-02-01 1 1 1
2  B  Y 2015-02-03 1 2 0
3  C  N 2015-02-01 2 2 1
4  C  N 2015-02-05 3 6 1
5  B  Y 2015-02-03 1 2 0
6  C  N 2015-02-01 2 3 0
7  C  Y 2015-02-03 4 1 1

I would need the mode for each numeric column:

mode=as.numeric(c("NA","NA", "NA", 1,2,1))
mode
[1] NA NA NA  1  2  1

and a vector of percentages of rows with t==1, when value in column == mode

[1] NA NA NA  0.33  0.33  

and a vector of percentages of rows with t==1, when value in column != mode

[1] NA NA NA  0.75  0.75

How could I calculate such vectors?

The best I have found for mode is:

library(plyr)

mode_fun <- function(x) {
  mode0 <- names(which.max(table(x)))
  if(is.numeric(x)) return(as.numeric(mode0))
  mode0
}
kdf_mode=apply(kdf,2, numcolwise(mode_fun))

But it gives an error if there are any non numeric columns.


Solution

  • We can use sapply to loop over the columns of 'df', apply the mode_fun to get the output vector ('v1'). We use an if/else condition to return NA for non-numeric columns.

     v1 <- unname(sapply(df, function(x) if(!is.numeric(x)) NA else mode_fun(x)))
     v1
     #[1] NA NA NA  1  2  1
    

    For the second case (I guess we don't need the 6th column i.e. 't'). We loop through the columns of 'df' with sapply, use the if/else condition. In the else condition, we compare whether the mode values is equal to the column values (mode_fun(x)==x)). We use the & to get the logical index of values that are equal to mode that corresponds to t==1. Get the sum and divide by the sum(v1).

    unname(sapply(df[-6], function(x) if(!is.numeric(x)) {
                NA
                } else {
                    v1 <- mode_fun(x)==x
                    sum(v1 & t==1)/sum(v1) 
      } ))
     #[1]        NA        NA        NA 0.3333333 0.3333333
    

    For the third, we change the condition to get the logical index where the column is not equal to the mode. Do the same as in the previous case.

    unname(sapply(df[-6], function(x) if(!is.numeric(x)){
             NA 
             } else {
                  v1 <- mode_fun(x)!=x
                  sum(v1 & t==1)/sum(v1)
       } ))
     #[1]   NA   NA   NA 0.75 0.75
    

    After we calculate 'v1', this can be also done without looping with sapply. We create a logical index where the column class is 'numeric' and the column names is not 't' ('indx').

    indx <- sapply(df, is.numeric) &  names(df)!='t'
    

    We subset the 'df' and 'v1' based on 'indx' (df[indx], v1[indx]), make the lengths by replicating the vector using col. The col gives the numeric index of the columns in df[indx]. Then we check whether the subset dataset is equal to the vector to give a logical matrix.

    indx1 <- df[indx]==v1[indx][col(df[indx])] 
    

    As in the previous code, we use & to check whether the TRUE values in 'indx1' also corresponds to 't==1. DocolSums, divide by thecolSumsof 'indx1', and concatenate (c) with theNA` elements of 'v1'

    unname(c(v1[is.na(v1)], colSums(indx1& t==1)/colSums(indx1)))
    #[1]        NA        NA        NA 0.3333333 0.3333333
    

    Similarly, we can create 'indx2' by changing the condition and then do colSums as before

    indx2 <- df[indx]!=v1[indx][col(df[indx])] 
    unname(c(v1[is.na(v1)], colSums(indx2& t==1)/colSums(indx2)))
    #[1]   NA   NA   NA 0.75 0.75