I'm trying to calculate the mode for numeric columns. The columns which are not numeric, should have a "NA" as a placeholder in the vector. I would also need percentages according to a target. Some example data:
c1= c("A", "B", "C", "C", "B", "C", "C")
c2= factor(c(1, 1, 2, 2,1,2,1), labels = c("Y","N"))
d= as.Date(c("2015-02-01", "2015-02-03","2015-02-01","2015-02-05", "2015-02-03","2015-02-01", "2015-02-03"), format="%Y-%m-%d")
x= c(1,1,2,3,1,2,4)
y= c(1,2,2,6,2,3,1)
t= c(1,0,1,1,0,0,1)
df=data.frame(c1, c2, d, x, y,t)
df
c1 c2 d x y t
1 A Y 2015-02-01 1 1 1
2 B Y 2015-02-03 1 2 0
3 C N 2015-02-01 2 2 1
4 C N 2015-02-05 3 6 1
5 B Y 2015-02-03 1 2 0
6 C N 2015-02-01 2 3 0
7 C Y 2015-02-03 4 1 1
I would need the mode for each numeric column:
mode=as.numeric(c("NA","NA", "NA", 1,2,1))
mode
[1] NA NA NA 1 2 1
and a vector of percentages of rows with t==1, when value in column == mode
[1] NA NA NA 0.33 0.33
and a vector of percentages of rows with t==1, when value in column != mode
[1] NA NA NA 0.75 0.75
How could I calculate such vectors?
The best I have found for mode is:
library(plyr)
mode_fun <- function(x) {
mode0 <- names(which.max(table(x)))
if(is.numeric(x)) return(as.numeric(mode0))
mode0
}
kdf_mode=apply(kdf,2, numcolwise(mode_fun))
But it gives an error if there are any non numeric columns.
We can use sapply
to loop over the columns of 'df', apply the mode_fun
to get the output vector
('v1'). We use an if/else
condition to return NA
for non-numeric columns.
v1 <- unname(sapply(df, function(x) if(!is.numeric(x)) NA else mode_fun(x)))
v1
#[1] NA NA NA 1 2 1
For the second case (I guess we don't need the 6th column i.e. 't'). We loop through the columns of 'df' with sapply
, use the if/else
condition. In the else
condition, we compare whether the mode
values is equal to the column values (mode_fun(x)==x)
). We use the &
to get the logical index of values that are equal to mode
that corresponds to t==1
. Get the sum
and divide by the sum(v1)
.
unname(sapply(df[-6], function(x) if(!is.numeric(x)) {
NA
} else {
v1 <- mode_fun(x)==x
sum(v1 & t==1)/sum(v1)
} ))
#[1] NA NA NA 0.3333333 0.3333333
For the third, we change the condition to get the logical index where the column is not equal to the mode
. Do the same as in the previous case.
unname(sapply(df[-6], function(x) if(!is.numeric(x)){
NA
} else {
v1 <- mode_fun(x)!=x
sum(v1 & t==1)/sum(v1)
} ))
#[1] NA NA NA 0.75 0.75
After we calculate 'v1', this can be also done without looping with sapply
. We create a logical index where the column class
is 'numeric' and the column names is not 't' ('indx').
indx <- sapply(df, is.numeric) & names(df)!='t'
We subset the 'df' and 'v1' based on 'indx' (df[indx]
, v1[indx]
), make the lengths by replicating the vector
using col
. The col
gives the numeric index of the columns in df[indx]
. Then we check whether the subset dataset is equal to the vector
to give a logical matrix.
indx1 <- df[indx]==v1[indx][col(df[indx])]
As in the previous code, we use &
to check whether the TRUE
values in 'indx1' also corresponds to 't==1. Do
colSums, divide by the
colSumsof 'indx1', and concatenate (
c) with the
NA` elements of 'v1'
unname(c(v1[is.na(v1)], colSums(indx1& t==1)/colSums(indx1)))
#[1] NA NA NA 0.3333333 0.3333333
Similarly, we can create 'indx2' by changing the condition and then do colSums
as before
indx2 <- df[indx]!=v1[indx][col(df[indx])]
unname(c(v1[is.na(v1)], colSums(indx2& t==1)/colSums(indx2)))
#[1] NA NA NA 0.75 0.75