I am trying to determine normality for values in rows of a data frame. Ideally, I want to calculate per row Shapiro-Wilk test (as many tests as rows are in the data frame).
The real dataset is big, but for this purpose I am using an example.
dput(example)
structure(c(103L, 122L, 40L, 107L, 124L, 108L, 89L, 102L, 40L,
70L, 78L, 78L, 78L, 78L, 64L, 64L, 64L, 50L, 50L, 50L, 133L,
64L, 55L, 64L, 108L, 124L, 108L, 146L, 13L, 40L, 122L, 124L,
107L, 122L, 133L, 122L, 107L, 121L, 70L, 113L, NA, 108L, NA,
40L, 122L, 89L, 36L, 113L, 26L, 26L, NA, 103L, NA, 55L, 153L,
146L, 36L, NA, NA, 77L, NA, 133L, NA, 36L, 167L, 92L, 65L, NA,
NA, 40L, NA, 107L, NA, 89L, 146L, NA, 92L, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), .Dim = 10:9, .Dimnames = list(
c("7", "10", "51", "62", "4", "5", "79", "16", "17", "243"
), c("centroid", "n_1", "n_2", "n_3", "n_4", "n_5", "n_6",
"n_7", "n_8")))
As said, I would like to test normality for each row and I predict some rows will "pass" and for others normality won't be calculated because there are not enough values or they are all identical. I am actually very interested in these since I am trying to prove this is a bad idea. I would like my results to get written into a new column and if normality test cannot be calculated an error message will appear (something ERROR/FALSE)
I can calculate Shapiro for any row like this:
shapiro.test(example[1,])
Shapiro-Wilk normality test
data: example[1, ]
W = 0.9631, p-value = 0.7984
And I should be able to calculate per row Shapiro like this (not working):
> apply(example, example[1:10,], shapiro.test)
Error in d[-MARGIN] : only 0's may be mixed with negative subscripts
I hope someone can point me towards the right direction. Thanks!
You could write a function for getting your desired result:
df <- structure(c(103L, 122L, 40L, 107L, 124L, 108L, 89L, 102L, 40L,
70L, 78L, 78L, 78L, 78L, 64L, 64L, 64L, 50L, 50L, 50L, 133L,
64L, 55L, 64L, 108L, 124L, 108L, 146L, 13L, 40L, 122L, 124L,
107L, 122L, 133L, 122L, 107L, 121L, 70L, 113L, NA, 108L, NA,
40L, 122L, 89L, 36L, 113L, 26L, 26L, NA, 103L, NA, 55L, 153L,
146L, 36L, NA, NA, 77L, NA, 133L, NA, 36L, 167L, 92L, 65L, NA,
NA, 40L, NA, 107L, NA, 89L, 146L, NA, 92L, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), .Dim = 10:9, .Dimnames = list(
c("7", "10", "51", "62", "4", "5", "79", "16", "17", "243"
), c("centroid", "n_1", "n_2", "n_3", "n_4", "n_5", "n_6",
"n_7", "n_8")))
f.shapiro.stat <- function(x, n_diff_numbers = 3) {
res <- ifelse(sum(!is.na(unique(x))) < n_diff_numbers, 'ERROR', shapiro.test(x)$statistic)
return(res)
}
res <- apply(df, 1, f.shapiro.stat, n_diff_numbers = 3)
df2 <- as.data.frame(df)
df2$shapiro <- res
df2
> df2
centroid n_1 n_2 n_3 n_4 n_5 n_6 n_7 n_8 shapiro
7 103 78 133 122 NA NA NA NA NA 0.9630974
10 122 78 64 124 108 103 133 107 NA 0.9225951
51 40 78 55 107 NA NA NA NA NA 0.9723459
62 107 78 64 122 40 55 36 89 NA 0.9552869
4 124 64 108 133 122 153 167 146 NA 0.9385053
5 108 64 124 122 89 146 92 NA NA 0.9809580
79 89 64 108 107 36 36 65 92 NA 0.8915689
16 102 50 146 121 113 NA NA NA NA 0.9307804
17 40 50 13 70 26 NA NA NA NA 0.9911093
243 70 50 40 113 26 77 40 NA NA 0.9238762
The function also checks if there is enough variation in your data. Example:
> f.shapiro.stat(x = rep(1,1,1))
[1] "ERROR"