Search code examples
rscale

how to calculate z-score using scale() function with NA values


I have a data frame with 98790 obs. of 143 variables. It contains both numbers and NA in it. I would like to perform z-score for each row. I tried the following:

>df
sample1 sample2 sample3 sample4 sample5 sampl6 sample7 sample8
1:     6.96123  3.021311          NA        NA  7.464205   7.902878  -1.194076   7.771018
2:          NA        NA          NA        NA        NA         NA         NA         NA
3:          NA        NA          NA        NA        NA         NA   2.784635         NA
4:          NA        NA    8.342075        NA  8.464205         NA   6.462707   7.118941
5:          NA  7.243703   10.149430        NA        NA   8.317915         NA         NA

And:

>res <- t(scale(t(df)))

Will the above function ignore all NAs and calculate the z-score? if not, how can I calculate z score without considering NAs ?


Solution

  • You might want to convert to a matrix before transposing/scaling/re-transposing (data frame -> matrix -> transpose -> scale -> transpose -> data frame)

    Otherwise, seems to work fine. Here's an example with some NA values included:

    set.seed(101)
    m <- matrix(rnorm(25),5,5)
    m[sample(1:25,size=8)] <- NA
    m
    ##            [,1] [,2]       [,3]       [,4]       [,5]
    ## [1,] -0.3260365   NA  0.5264481 -0.1933380         NA
    ## [2,]  0.5524619   NA -0.7948444 -0.8497547  0.7085221
    ## [3,] -0.6749438   NA  1.4277555  0.0584655 -0.2679805
    ## [4,]  0.2143595   NA -1.4668197 -0.8176704 -1.4639218
    ## [5,]         NA   NA -0.2366834         NA  0.7444358
    scale(m)
    ##            [,1] [,2]       [,3]       [,4]       [,5]
    ## [1,] -0.4885685   NA  0.5628440  0.5661203         NA
    ## [2,]  1.1159619   NA -0.6077977 -0.8785073  0.7475404
    ## [3,] -1.1258292   NA  1.3613864  1.1202838 -0.1904198
    ## [4,]  0.4984359   NA -1.2031558 -0.8078967 -1.3391573
    ## [5,]         NA   NA -0.1132769         NA  0.7820366
    ## attr(,"scaled:center")
    ## [1] -0.05853976         NaN -0.10882877 -0.45057439 -0.06973609
    ## attr(,"scaled:scale")
    ## [1] 0.5475112 0.0000000 1.1286908 0.4543848 1.0410918
    

    It's also the case that the documentation (?scale) is very explicit about how NA values are handled:

    ... centering is done by subtracting the column means (omitting ‘NA’s) of ‘x’ from their corresponding columns ...

    ... the root-mean-square for a (possibly centered) column is defined as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing values and n is the number of non-missing values ...

    (emphasis added)