Search code examples
rstatisticsnormal-distribution

Perform a Shapiro-Wilk Normality Test


I want to perform a Shapiro-Wilk Normality Test test. My data is csv format. It looks like this:

 heisenberg
    HWWIchg
1    -15.60
2    -21.60
3    -19.50
4    -19.10
5    -20.90
6    -20.70
7    -19.30
8    -18.30
9    -15.10

However, when I perform the test, I get:

 shapiro.test(heisenberg)

Error in [.data.frame(x, complete.cases(x)) : undefined columns selected

Why isnt`t R selecting the right column and how do I do that?


Solution

  • What does shapiro.test do?

    shapiro.test tests the Null hypothesis that "the samples come from a Normal distribution" against the alternative hypothesis "the samples do not come from a Normal distribution".

    How to perform shapiro.test in R?

    The R help page for ?shapiro.test gives,

    x - a numeric vector of data values. Missing values are allowed, 
        but the number of non-missing values must be between 3 and 5000.
    

    That is, shapiro.test expects a numeric vector as input, that corresponds to the sample you would like to test and it is the only input required. Since you've a data.frame, you'll have to pass the desired column as input to the function as follows:

    > shapiro.test(heisenberg$HWWIchg)
    #   Shapiro-Wilk normality test
    
    # data:  heisenberg$HWWIchg 
    # W = 0.9001, p-value = 0.2528
    

    Interpreting results from shapiro.test:

    First, I strongly suggest you read this excellent answer from Ian Fellows on testing for normality.

    As shown above, the shapiro.test tests the NULL hypothesis that the samples came from a Normal distribution. This means that if your p-value <= 0.05, then you would reject the NULL hypothesis that the samples came from a Normal distribution. As Ian Fellows nicely put it, you are testing against the assumption of Normality". In other words (correct me if I am wrong), it would be much better if one tests the NULL hypothesis that the samples do not come from a Normal distribution. Why? Because, rejecting a NULL hypothesis is not the same as accepting the alternative hypothesis.

    In case of the null hypothesis of shapiro.test, a p-value <= 0.05 would reject the null hypothesis that the samples come from normal distribution. To put it loosely, there is a rare chance that the samples came from a normal distribution. The side-effect of this hypothesis testing is that this rare chance happens very rarely. To illustrate, take for example:

    set.seed(450)
    x <- runif(50, min=2, max=4)
    shapiro.test(x)
    #   Shapiro-Wilk normality test
    # data:  runif(50, min = 2, max = 4) 
    # W = 0.9601, p-value = 0.08995
    

    So, this (particular) sample runif(50, min=2, max=4) comes from a normal distribution according to this test. What I am trying to say is that, there are many many cases under which the "extreme" requirements (p < 0.05) are not satisfied which leads to acceptance of "NULL hypothesis" most of the times, which might be misleading.

    Another issue I'd like to quote here from @PaulHiemstra from under comments about the effects on large sample size:

    An additional issue with the Shapiro-Wilk's test is that when you feed it more data, the chances of the null hypothesis being rejected becomes larger. So what happens is that for large amounts of data even very small deviations from normality can be detected, leading to rejection of the null hypothesis event though for practical purposes the data is more than normal enough.

    Although he also points out that R's data size limit protects this a bit:

    Luckily shapiro.test protects the user from the above described effect by limiting the data size to 5000.

    If the NULL hypothesis were the opposite, meaning, the samples do not come from a normal distribution, and you get a p-value < 0.05, then you conclude that it is very rare that these samples do not come from a normal distribution (reject the NULL hypothesis). That loosely translates to: It is highly likely that the samples are normally distributed (although some statisticians may not like this way of interpreting). I believe this is what Ian Fellows also tried to explain in his post. Please correct me if I've gotten something wrong!

    @PaulHiemstra also comments about practical situations (example regression) when one comes across this problem of testing for normality:

    In practice, if an analysis assumes normality, e.g. lm, I would not do this Shapiro-Wilk's test, but do the analysis and look at diagnostic plots of the outcome of the analysis to judge whether any assumptions of the analysis where violated too much. For linear regression using lm this is done by looking at some of the diagnostic plots you get using plot(lm()). Statistics is not a series of steps that cough up a few numbers (hey p < 0.05!) but requires a lot of experience and skill in judging how to analysis your data correctly.

    Here, I find the reply from Ian Fellows to Ben Bolker's comment under the same question already linked above equally (if not more) informative:

    For linear regression,

    1. Don't worry much about normality. The CLT takes over quickly and if you have all but the smallest sample sizes and an even remotely reasonable looking histogram you are fine.

    2. Worry about unequal variances (heteroskedasticity). I worry about this to the point of (almost) using HCCM tests by default. A scale location plot will give some idea of whether this is broken, but not always. Also, there is no a priori reason to assume equal variances in most cases.

    3. Outliers. A cooks distance of > 1 is reasonable cause for concern.

    Those are my thoughts (FWIW).

    Hope this clears things up a bit.