Search code examples
rstatisticsnormal-distribution

Per row Shapiro-Wilk test


I am trying to determine normality for values in rows of a data frame. Ideally, I want to calculate per row Shapiro-Wilk test (as many tests as rows are in the data frame).

The real dataset is big, but for this purpose I am using an example.

dput(example)
structure(c(103L, 122L, 40L, 107L, 124L, 108L, 89L, 102L, 40L, 
70L, 78L, 78L, 78L, 78L, 64L, 64L, 64L, 50L, 50L, 50L, 133L, 
64L, 55L, 64L, 108L, 124L, 108L, 146L, 13L, 40L, 122L, 124L, 
107L, 122L, 133L, 122L, 107L, 121L, 70L, 113L, NA, 108L, NA, 
40L, 122L, 89L, 36L, 113L, 26L, 26L, NA, 103L, NA, 55L, 153L, 
146L, 36L, NA, NA, 77L, NA, 133L, NA, 36L, 167L, 92L, 65L, NA, 
NA, 40L, NA, 107L, NA, 89L, 146L, NA, 92L, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA), .Dim = 10:9, .Dimnames = list(
    c("7", "10", "51", "62", "4", "5", "79", "16", "17", "243"
    ), c("centroid", "n_1", "n_2", "n_3", "n_4", "n_5", "n_6", 
    "n_7", "n_8")))

As said, I would like to test normality for each row and I predict some rows will "pass" and for others normality won't be calculated because there are not enough values or they are all identical. I am actually very interested in these since I am trying to prove this is a bad idea. I would like my results to get written into a new column and if normality test cannot be calculated an error message will appear (something ERROR/FALSE)

enter image description here

I can calculate Shapiro for any row like this:

shapiro.test(example[1,])
    Shapiro-Wilk normality test

data:  example[1, ]
W = 0.9631, p-value = 0.7984

And I should be able to calculate per row Shapiro like this (not working):

> apply(example, example[1:10,], shapiro.test) 
Error in d[-MARGIN] : only 0's may be mixed with negative subscripts

I hope someone can point me towards the right direction. Thanks!


Solution

  • You could write a function for getting your desired result:

    df <- structure(c(103L, 122L, 40L, 107L, 124L, 108L, 89L, 102L, 40L, 
                      70L, 78L, 78L, 78L, 78L, 64L, 64L, 64L, 50L, 50L, 50L, 133L, 
                      64L, 55L, 64L, 108L, 124L, 108L, 146L, 13L, 40L, 122L, 124L, 
                      107L, 122L, 133L, 122L, 107L, 121L, 70L, 113L, NA, 108L, NA, 
                      40L, 122L, 89L, 36L, 113L, 26L, 26L, NA, 103L, NA, 55L, 153L, 
                      146L, 36L, NA, NA, 77L, NA, 133L, NA, 36L, 167L, 92L, 65L, NA, 
                      NA, 40L, NA, 107L, NA, 89L, 146L, NA, 92L, NA, NA, NA, NA, NA, 
                      NA, NA, NA, NA, NA, NA, NA, NA), .Dim = 10:9, .Dimnames = list(
                        c("7", "10", "51", "62", "4", "5", "79", "16", "17", "243"
                        ), c("centroid", "n_1", "n_2", "n_3", "n_4", "n_5", "n_6", 
                             "n_7", "n_8")))
    
    f.shapiro.stat <- function(x, n_diff_numbers = 3) {
      res <- ifelse(sum(!is.na(unique(x))) < n_diff_numbers, 'ERROR', shapiro.test(x)$statistic)
      return(res)
    }
    
    res <- apply(df, 1, f.shapiro.stat, n_diff_numbers = 3)
    
    df2 <- as.data.frame(df)
    df2$shapiro <- res
    df2
    > df2
        centroid n_1 n_2 n_3 n_4 n_5 n_6 n_7 n_8   shapiro
    7        103  78 133 122  NA  NA  NA  NA  NA 0.9630974
    10       122  78  64 124 108 103 133 107  NA 0.9225951
    51        40  78  55 107  NA  NA  NA  NA  NA 0.9723459
    62       107  78  64 122  40  55  36  89  NA 0.9552869
    4        124  64 108 133 122 153 167 146  NA 0.9385053
    5        108  64 124 122  89 146  92  NA  NA 0.9809580
    79        89  64 108 107  36  36  65  92  NA 0.8915689
    16       102  50 146 121 113  NA  NA  NA  NA 0.9307804
    17        40  50  13  70  26  NA  NA  NA  NA 0.9911093
    243       70  50  40 113  26  77  40  NA  NA 0.9238762
    

    The function also checks if there is enough variation in your data. Example:

    > f.shapiro.stat(x = rep(1,1,1))
    [1] "ERROR"