Search code examples
rstatisticsr-caretvariance

What does NearZeroVar in R?


I have rather huge dataset in which I would like to exclude columns with a rather low variance, which is why I would like to use the phrase NearZeroVar. However, I do have some trouble understanding what freqCut and uniqueCut do and how they influence each other. I already read the explanation in R but that does not really help me with this one. If anyone could explain it to me, I would be very thankful!


Solution

  • If a variable has very little change or variation, it's like a constant and not useful for prediction. This would have close to zero variance, hence the name of the function.

    The two parameters do not influence each other, they are there to take care of common scenarios that give rise to variable of near zero variance. The column needs to fail both criteria to be excluded.

    Let's use an example:

    mat = cbind(1,rep(c(1,2),c(8,1)),rep(1:3,3),1:9)
    mat
          [,1] [,2] [,3] [,4]
     [1,]    1    1    1    1
     [2,]    1    1    2    2
     [3,]    1    1    3    3
     [4,]    1    1    1    4
     [5,]    1    1    2    5
     [6,]    1    1    3    6
     [7,]    1    1    1    7
     [8,]    1    1    2    8
     [9,]    1    2    3    9
    

    If we use the default, which calls for 95/5 for most common to 2nd and unique values, you can see only 1st column is taken out:

    nearZeroVar(mat)
    [1] 1
    

    Let's look at the 2nd column, the most common to second most is 8/1, and it has 2 unique values, making it 2/9 = 0.22. So for this to be filtered out , you need to change the two settings:

    nearZeroVar(mat,freqCut=7/1,uniqueCut=30)
    [1] 1 2
    

    Lastly, something you most likely should not filter out is column 3 or 4, so column we will filter out when we set something nonsense:

    nearZeroVar(mat,freqCut=0.1,uniqueCut=50)
    [1] 1 2 3