Search code examples
python-3.xpandasstatisticsoutlierstrading

How to correctly check if there's at least one outlier in a OHLC price series on Python? Pandas related


Say you have the following df called df_trading_pair which contains price series registered in the 30m timeframe:

    Start Date              Open Price  High Price  Low Price   Close Price  End Date
0   2022-07-09 22:30:00.000 0.04805     0.04806     0.0476      0.04775      2022-07-09 22:59:59.999
1   2022-07-09 23:00:00.000 0.04773     0.04776     0.0476      0.04764      2022-07-09 23:29:59.999
2   2022-07-09 23:30:00.000 0.04762     0.04776     0.04752     0.04769      2022-07-09 23:59:59.999
3   2022-07-10 00:00:00.000 0.04771     0.04789     0.04687     0.047        2022-07-10 00:29:59.999
4   2022-07-10 00:30:00.000 0.047       0.04708     0.04663     0.04671      2022-07-10 00:59:59.999
5   2022-07-10 01:00:00.000 0.0467      0.04681     0.04648     0.04671      2022-07-10 01:29:59.999
6   2022-07-10 01:30:00.000 0.04671     0.04698     0.04656     0.0467       2022-07-10 01:59:59.999
7   2022-07-10 02:00:00.000 0.04657     0.04671     0.04623     0.04625      2022-07-10 02:29:59.999
8   2022-07-10 02:30:00.000 0.04625     0.04629     0.04603     0.04624      2022-07-10 02:59:59.999
9   2022-07-10 03:00:00.000 0.04622     0.04623     0.04584     0.04591      2022-07-10 03:29:59.999
10  2022-07-10 03:30:00.000 0.04591     0.04677     0.04576     0.04586      2022-07-10 03:59:59.999
11  2022-07-10 04:00:00.000 0.04581     0.0461      0.04549     0.04592      2022-07-10 04:29:59.999
12  2022-07-10 04:30:00.000 0.04595     0.04614     0.04575     0.04596      2022-07-10 04:59:59.999
13  2022-07-10 05:00:00.000 0.04593     0.0461      0.04574     0.04596      2022-07-10 05:29:59.999
14  2022-07-10 05:30:00.000 0.04601     0.04614     0.04582     0.0459       2022-07-10 05:59:59.999
15  2022-07-10 06:00:00.000 0.04589     0.04591     0.04553     0.04571      2022-07-10 06:29:59.999
16  2022-07-10 06:30:00.000 0.04563     0.04586     0.04532     0.04552      2022-07-10 06:59:59.999
17  2022-07-10 07:00:00.000 0.04546     0.04571     0.04543     0.04563      2022-07-10 07:29:59.999
18  2022-07-10 07:30:00.000 0.04559     0.04592     0.04559     0.04579      2022-07-10 07:59:59.999
19  2022-07-10 08:00:00.000 0.04579     0.04611     0.04579     0.04598      2022-07-10 08:29:59.999
20  2022-07-10 08:30:00.000 0.04601     0.0461      0.04572     0.04576      2022-07-10 08:59:59.999
21  2022-07-10 09:00:00.000 0.04576     0.04587     0.04562     0.04572      2022-07-10 09:29:59.999
22  2022-07-10 09:30:00.000 0.04573     0.04589     0.043       0.0456       2022-07-10 09:59:59.999
23  2022-07-10 10:00:00.000 0.04559     0.04584     0.04555     0.04562      2022-07-10 10:29:59.999
24  2022-07-10 10:30:00.000 0.04567     0.04581     0.04552     0.04555      2022-07-10 10:59:59.999
25  2022-07-10 11:00:00.000 0.04561     0.04588     0.04548     0.04549      2022-07-10 11:29:59.999
26  2022-07-10 11:30:00.000 0.0455      0.04564     0.04532     0.04553      2022-07-10 11:59:59.999
27  2022-07-10 12:00:00.000 0.04558     0.04568     0.04549     0.04563      2022-07-10 12:29:59.999
28  2022-07-10 12:30:00.000 0.04563     0.04565     0.04555     0.04557      2022-07-10 12:59:59.999
29  2022-07-10 13:00:00.000 0.04568     0.04578     0.04534     0.04542      2022-07-10 13:29:59.999
30  2022-07-10 13:30:00.000 0.04543     0.04669     0.0447      0.04483      2022-07-10 13:59:59.999
31  2022-07-10 14:00:00.000 0.04481     0.04481     0.04434     0.04448      2022-07-10 14:29:59.999
32  2022-07-10 14:30:00.000 0.04449     0.04453     0.04421     0.04443      2022-07-10 14:59:59.999
33  2022-07-10 15:00:00.000 0.04442     0.0448      0.04436     0.04447      2022-07-10 15:29:59.999
34  2022-07-10 15:30:00.000 0.04441     0.04465     0.04438     0.0446       2022-07-10 15:59:59.999
35  2022-07-10 16:00:00.000 0.04463     0.04464     0.04419     0.04432      2022-07-10 16:29:59.999
36  2022-07-10 16:30:00.000 0.04434     0.04474     0.0443      0.04452      2022-07-10 16:59:59.999
37  2022-07-10 17:00:00.000 0.04452     0.04482     0.04451     0.0448       2022-07-10 17:29:59.999
38  2022-07-10 17:30:00.000 0.0448      0.04499     0.04475     0.04497      2022-07-10 17:59:59.999
39  2022-07-10 18:00:00.000 0.04488     0.045       0.04472     0.04499      2022-07-10 18:29:59.999
40  2022-07-10 18:30:00.000 0.04502     0.04539     0.04483     0.04535      2022-07-10 18:59:59.999
41  2022-07-10 19:00:00.000 0.04532     0.04559     0.04497     0.04509      2022-07-10 19:29:59.999
42  2022-07-10 19:30:00.000 0.0451      0.04531     0.04501     0.04514      2022-07-10 19:59:59.999
43  2022-07-10 20:00:00.000 0.04514     0.04528     0.04508     0.04525      2022-07-10 20:29:59.999
44  2022-07-10 20:30:00.000 0.04515     0.0452      0.04509     0.04517      2022-07-10 20:59:59.999
45  2022-07-10 21:00:00.000 0.04522     0.04546     0.04513     0.04514      2022-07-10 21:29:59.999
46  2022-07-10 21:30:00.000 0.04519     0.04534     0.04513     0.04534      2022-07-10 21:59:59.999
47  2022-07-10 22:00:00.000 0.04534     0.04534     0.04474     0.0449       2022-07-10 22:29:59.999
48  2022-07-10 22:30:00.000 0.0448      0.04507     0.04475     0.04482      2022-07-10 22:59:59.999
49  2022-07-10 23:00:00.000 0.04481     0.04481     0.04443     0.04455      2022-07-10 23:29:59.999
50  2022-07-10 23:30:00.000 0.04451     0.04451     0.04428     0.04442      2022-07-10 23:59:59.999
51  2022-07-11 00:00:00.000 0.04442     0.04442     0.04374     0.04382      2022-07-11 00:29:59.999
52  2022-07-11 00:30:00.000 0.0438      0.04413     0.04368     0.04399      2022-07-11 00:59:59.999
53  2022-07-11 01:00:00.000 0.04395     0.04402     0.04319     0.04342      2022-07-11 01:29:59.999
54  2022-07-11 01:30:00.000 0.04343     0.04348     0.04323     0.04329      2022-07-11 01:59:59.999
55  2022-07-11 02:00:00.000 0.04333     0.04338     0.04314     0.04331      2022-07-11 02:29:59.999
56  2022-07-11 02:30:00.000 0.04332     0.04332     0.04312     0.04318      2022-07-11 02:59:59.999
57  2022-07-11 03:00:00.000 0.04317     0.04337     0.04309     0.04322      2022-07-11 03:29:59.999
58  2022-07-11 03:30:00.000 0.04322     0.04327     0.04299     0.04307      2022-07-11 03:59:59.999
59  2022-07-11 04:00:00.000 0.04305     0.04313     0.04273     0.0428       2022-07-11 04:29:59.999

After you plot the price series above using the following lines:

import finplot as fplt
fplt.candlestick_ochl(df_trading_pair[['Start Date', 'Open Price', 'Close Price', 'High Price', 'Low Price']]).colors.update(dict(bull_frame='#26a69a', bull_body='#26a69a', bull_shadow='#26a69a'))
fplt.show()

You get its corresponding trading chart:

trading_chart

Then, you realize that this trading series clearly has an outlier, which is 0.043 and can be called by running df_trading_pair['Low Price'].iat[22] in this case.

The thing is, how could a few lines of code identify that?

I was thinking of considering using the following formula:

In [2]: (df_trading_pair["Close Price"]-df_trading_pair["Low Price"])/df_trading_pair["Low Price"]*100

Which returns the following data:

0     0.315126
1     0.084034
2     0.357744
3     0.277363
4     0.171563
5     0.494836
6     0.300687
7     0.043262
8     0.456224
9     0.152705
10    0.218531
11    0.945263
12    0.459016
13    0.480979
14    0.174596
15    0.395344
16    0.441306
17    0.440238
18    0.438693
19    0.414938
20    0.087489
21    0.219202
22    6.046512
23    0.153677
24    0.065905
25    0.021988
26    0.463372
27    0.307760
28    0.043908
29    0.176445
30    0.290828
31    0.315742
32    0.497625
33    0.247971
34    0.495719
35    0.294184
36    0.496614
37    0.651539
38    0.491620
39    0.603757
40    1.159938
41    0.266845
42    0.288825
43    0.377107
44    0.177423
45    0.022158
46    0.465322
47    0.357622
48    0.156425
49    0.270088
50    0.316170
51    0.182899
52    0.709707
53    0.532531
54    0.138793
55    0.394066
56    0.139147
57    0.301694
58    0.186090
59    0.163819
dtype: float64

I understand that the values above should move between from 0.021987 to 1.159938 and that 6.046512 is the outlier, but I'm not an expert in statistics so I don't know how exactly should I play with this, may I get some help here?


Solution

  • I finally managed to figure out a solution adjusted to my case, first I create a function called outlier:

    def outlier(df, col_name):
        q1 = np.percentile(np.array(df[col_name].tolist()), 25)
        q3 = np.percentile(np.array(df[col_name].tolist()), 75)
        IQR = q3 - q1
                          
        Q3 = q1+(3*IQR)
        Q1 = q3-(3*IQR)
        outlier_num = 0
                          
        for value in df[col_name].values.tolist():
            if (value < Q1) | (value > Q3):
                print(f'the outlier value is {value}')
                outlier_num +=1
        return Q1, Q3, outlier_num
    

    I create a sub-dataframe called df_open_low_price_range which will store the series for further analysis with the aim of finding outlier values (in my particular case, the equation above is the right one to use in order to find the outlier value):

    df_open_low_price_range = pd.DataFrame((df_trading_pair["Open Price"]-df_trading_pair["Low Price"])/df_trading_pair["Low Price"]*100, columns= ["Open Low Price Range"])
    

    Then the outlier function is executed which ends up returning a tuple that is stored in a variable called the_outlier:

    the_outlier = outlier(df_open_low_price_range, 'Open Low Price Range')
    

    Finally, 2 conditions are set to know if there were outlier values or not in the series provided:

    if the_outlier[2] == 0:
       print("This series DOES NOT HAVE outliers")
    
    elif the_outlier[2] >= 1:
       print("This series DOES HAVE outliers")