I want to calculate t-Statistic for feature selection in R with for loop. Data has 155 columns and dependent variable is binary (mutagen - nonmutagen). I would like to assign a t-stat for every columns. The problem is I couldn't figure out how can I write it.
Here is the formula I'm trying to implement in R:
Also I wrote a code but I'm not sure about it and it's just for first column. I need to write it in for loop for all columns.
abs(diff(tapply(train_df[,1], train_df$Activity, mean))) / sqrt(sd((train_df$NEG_01_NEG[train_df$Activity == "mutagen"])^2) / (length(train_df$NEG_01_NEG[train_df$Activity == "mutagen"])) +
sd((train_df$NEG_01_NEG[train_df$Activity != "mutagen"])^2) / (length(train_df$NEG_01_NEG[train_df$Activity != "mutagen"])))
Thanks in advance!
If you don't want to worry about speed (and with 155 columns you probably don't care) you can use the t.test
function and apply it to every column.
Simulate some data first
set.seed(1)
DF <- data.frame(y=rep(1:2, 50), x1=rnorm(100), x2=rnorm(100), x3=rnorm(100))
head(DF)
y x1 x2 x3
1 1 -0.6264538 -0.62036668 0.4094018
2 2 0.1836433 0.04211587 1.6888733
3 1 -0.8356286 -0.91092165 1.5865884
4 2 1.5952808 0.15802877 -0.3309078
5 1 0.3295078 -0.65458464 -2.2852355
6 2 -0.8204684 1.76728727 2.4976616
Then we can apply the t.test
function to all but the first column using the formula argument.
group <- DF$y
lapply(DF[,-1], function(x) { t.test(x ~ group)$statistic })
which returns the test statistic for each column.
t.test
computes a lot of extra information that you don't need so you can speed this up substantially by doing the computations directly, but it really isn't necessary here