r machine-learning t-test feature-selection

t-stat for feature selection

I want to calculate t-Statistic for feature selection in R with for loop. Data has 155 columns and dependent variable is binary (mutagen - nonmutagen). I would like to assign a t-stat for every columns. The problem is I couldn't figure out how can I write it.

Here is the formula I'm trying to implement in R:

Also I wrote a code but I'm not sure about it and it's just for first column. I need to write it in for loop for all columns.

abs(diff(tapply(train_df[,1], train_df$Activity, mean))) / sqrt(sd((train_df$NEG_01_NEG[train_df$Activity == "mutagen"])^2) / (length(train_df$NEG_01_NEG[train_df$Activity == "mutagen"])) + 
   sd((train_df$NEG_01_NEG[train_df$Activity != "mutagen"])^2) / (length(train_df$NEG_01_NEG[train_df$Activity != "mutagen"])))

Thanks in advance!

Solution

If you don't want to worry about speed (and with 155 columns you probably don't care) you can use the t.test function and apply it to every column.

Simulate some data first

set.seed(1)
DF <- data.frame(y=rep(1:2, 50), x1=rnorm(100), x2=rnorm(100), x3=rnorm(100))
head(DF)

  y         x1          x2         x3
1 1 -0.6264538 -0.62036668  0.4094018
2 2  0.1836433  0.04211587  1.6888733
3 1 -0.8356286 -0.91092165  1.5865884
4 2  1.5952808  0.15802877 -0.3309078
5 1  0.3295078 -0.65458464 -2.2852355
6 2 -0.8204684  1.76728727  2.4976616

Then we can apply the t.test function to all but the first column using the formula argument.

group <- DF$y
lapply(DF[,-1], function(x) { t.test(x ~ group)$statistic })

which returns the test statistic for each column.

t.test computes a lot of extra information that you don't need so you can speed this up substantially by doing the computations directly, but it really isn't necessary here