Search code examples
pythonquantitative-financetradingalgorithmic-trading

How to apply the Shapiro-Wilk Test on a specific data column in Python


I’d like to apply this test on the percent daily returns of SPY. After getting historical data of this symbol from Yahoo, I calculate the percent daily returns (as you can see on the below code). But when I apply the test the P value is always “1.00” and the return of the stats is always “nan”. No matter if I change the data dates or if I change the symbol (for example, QQQ instead SPY)

Below you can see the code that I’m using:

from datetime import date
import pandas_datareader as dr
from scipy.stats import shapiro

df = dr.data.get_data_yahoo('spy',start='2010-01-01',end='2015-01-01')
df['PCT'] = df['Close'].pct_change()

stat, p = shapiro(df['PCT'])
print('Statistics=%.3f, p=%.3f' % (stat, p))

Unfortunately I have tried different things but I couldn’t find the solution. I am stuck with it. Any idea about how to apply it correctly on the PCT column data? Any help will be more than welcome! Thanks!


Solution

  • Step 1: "when I apply the test the P value is always “1.00” and the return of the stats is always “nan

    No, sir, it is not.

    print( 'Statistics\n(W)= %e,\n p = %e' % ( stat, p ) ) # will produce:
    ...    
    (W)= 9.438160e-01
     p = 1.909053e-21
    

    The core issue is, to respect how things work:

    >>> df['PCT'] = df['Close'].pct_change() # this computes & stores .pct_change()
    >>> df                                   # read print( df['Close'].pct_change.__doc__ )
    
                      High  ...       Close       Volume   Adj Close       PCT
    Date                                                                                         
    2010-01-04  113.389999  ...  113.330002  118944600.0   93.675278       NaN
    2010-01-05  113.680000  ...  113.629997  111579900.0   93.923241  0.002647
    2010-01-06  113.989998  ...  113.709999  116074400.0   93.989357  0.000704
    2010-01-07  114.330002  ...  114.190002  131091100.0   94.386139  0.004221
    2010-01-08  114.620003  ...  114.570000  126402800.0   94.700218  0.003328
    ...
    

    Here, obviously, due to period == 1 the cell df['PCT'][0] is and must be NaN

    So, rather call W_stat, p_value = shapiro( df['PCT'][1:] ) not to include a value that has no meaning w.r.t. the shapiro()

    print( shapiro.__doc__ ) # for more details
    

    Comparing values to the reference sample - the normal distribution test, where there are no NaN-s causes straight a must to reject the null-hypothesis at being p == 1 an absolutely sure rejection ( which was obviously correct from the point of view of the two "incomparable-due-to-NaN(s)" sets compared ).


    Similarly { SPY | QQQ | AAPL | AMZN | ... } :

    >>> shapiro( dr.data.get_data_yahoo( 'SPY',
                                          start = '2010-01-01',
                                          end   = '2015-01-01'
                                          )['Close'].pct_change()[1:]
                 )
    (0.943816065788269, 1.9090532861060437e-21)
    
    >>> shapiro( dr.data.get_data_yahoo( 'QQQ',
                                          start = '2010-01-01',
                                          end   = '2015-01-01'
                                          )['Close'].pct_change()[1:]
                 )
    (0.9631340503692627, 2.548133516564297e-17)
    
    >>> shapiro( dr.data.get_data_yahoo( 'AAPL',
                                          start = '2010-01-01',
                                          end   = '2015-01-01'
                                          )['Close'].pct_change()[1:]
                 )
    (0.9560988545417786, 5.674560331738808e-19)
    
    >>> shapiro( dr.data.get_data_yahoo( 'AMZN',
                                          start = '2010-01-01',
                                          end   = '2015-01-01'
                                          )['Close'].pct_change()[1:]
                 )
    (0.9394155740737915, 3.106424182886848e-22)