In R code, I want to select all the variables from a dataset where same value occurs for each column is less than 40% for that column. I am appling the sapply, but not getting the correct output. Note: All the columns values are numeric.
train = train[, sapply(train, function(col) length(unique(col))) < 0.4*nrow(train)]
Please suggest how to proceed.
By playing around with a toy dataset, I found this code that works
train[, sapply(train, function(x) {(sort(table(x), decreasing = TRUE)/nrow(train))[[1]] < 0.4})]
Basically, I create the table of relative frequencies (sorted in decreasing order) for each numeric column in train
, and then I check whether the most frequent value for each column occurs less than 40% of the times. If yes, this column is selected, otherwise discarded.