Say you have the following df called df_trading_pair
which contains price series registered in the 30m timeframe:
Start Date Open Price High Price Low Price Close Price End Date
0 2022-07-09 22:30:00.000 0.04805 0.04806 0.0476 0.04775 2022-07-09 22:59:59.999
1 2022-07-09 23:00:00.000 0.04773 0.04776 0.0476 0.04764 2022-07-09 23:29:59.999
2 2022-07-09 23:30:00.000 0.04762 0.04776 0.04752 0.04769 2022-07-09 23:59:59.999
3 2022-07-10 00:00:00.000 0.04771 0.04789 0.04687 0.047 2022-07-10 00:29:59.999
4 2022-07-10 00:30:00.000 0.047 0.04708 0.04663 0.04671 2022-07-10 00:59:59.999
5 2022-07-10 01:00:00.000 0.0467 0.04681 0.04648 0.04671 2022-07-10 01:29:59.999
6 2022-07-10 01:30:00.000 0.04671 0.04698 0.04656 0.0467 2022-07-10 01:59:59.999
7 2022-07-10 02:00:00.000 0.04657 0.04671 0.04623 0.04625 2022-07-10 02:29:59.999
8 2022-07-10 02:30:00.000 0.04625 0.04629 0.04603 0.04624 2022-07-10 02:59:59.999
9 2022-07-10 03:00:00.000 0.04622 0.04623 0.04584 0.04591 2022-07-10 03:29:59.999
10 2022-07-10 03:30:00.000 0.04591 0.04677 0.04576 0.04586 2022-07-10 03:59:59.999
11 2022-07-10 04:00:00.000 0.04581 0.0461 0.04549 0.04592 2022-07-10 04:29:59.999
12 2022-07-10 04:30:00.000 0.04595 0.04614 0.04575 0.04596 2022-07-10 04:59:59.999
13 2022-07-10 05:00:00.000 0.04593 0.0461 0.04574 0.04596 2022-07-10 05:29:59.999
14 2022-07-10 05:30:00.000 0.04601 0.04614 0.04582 0.0459 2022-07-10 05:59:59.999
15 2022-07-10 06:00:00.000 0.04589 0.04591 0.04553 0.04571 2022-07-10 06:29:59.999
16 2022-07-10 06:30:00.000 0.04563 0.04586 0.04532 0.04552 2022-07-10 06:59:59.999
17 2022-07-10 07:00:00.000 0.04546 0.04571 0.04543 0.04563 2022-07-10 07:29:59.999
18 2022-07-10 07:30:00.000 0.04559 0.04592 0.04559 0.04579 2022-07-10 07:59:59.999
19 2022-07-10 08:00:00.000 0.04579 0.04611 0.04579 0.04598 2022-07-10 08:29:59.999
20 2022-07-10 08:30:00.000 0.04601 0.0461 0.04572 0.04576 2022-07-10 08:59:59.999
21 2022-07-10 09:00:00.000 0.04576 0.04587 0.04562 0.04572 2022-07-10 09:29:59.999
22 2022-07-10 09:30:00.000 0.04573 0.04589 0.043 0.0456 2022-07-10 09:59:59.999
23 2022-07-10 10:00:00.000 0.04559 0.04584 0.04555 0.04562 2022-07-10 10:29:59.999
24 2022-07-10 10:30:00.000 0.04567 0.04581 0.04552 0.04555 2022-07-10 10:59:59.999
25 2022-07-10 11:00:00.000 0.04561 0.04588 0.04548 0.04549 2022-07-10 11:29:59.999
26 2022-07-10 11:30:00.000 0.0455 0.04564 0.04532 0.04553 2022-07-10 11:59:59.999
27 2022-07-10 12:00:00.000 0.04558 0.04568 0.04549 0.04563 2022-07-10 12:29:59.999
28 2022-07-10 12:30:00.000 0.04563 0.04565 0.04555 0.04557 2022-07-10 12:59:59.999
29 2022-07-10 13:00:00.000 0.04568 0.04578 0.04534 0.04542 2022-07-10 13:29:59.999
30 2022-07-10 13:30:00.000 0.04543 0.04669 0.0447 0.04483 2022-07-10 13:59:59.999
31 2022-07-10 14:00:00.000 0.04481 0.04481 0.04434 0.04448 2022-07-10 14:29:59.999
32 2022-07-10 14:30:00.000 0.04449 0.04453 0.04421 0.04443 2022-07-10 14:59:59.999
33 2022-07-10 15:00:00.000 0.04442 0.0448 0.04436 0.04447 2022-07-10 15:29:59.999
34 2022-07-10 15:30:00.000 0.04441 0.04465 0.04438 0.0446 2022-07-10 15:59:59.999
35 2022-07-10 16:00:00.000 0.04463 0.04464 0.04419 0.04432 2022-07-10 16:29:59.999
36 2022-07-10 16:30:00.000 0.04434 0.04474 0.0443 0.04452 2022-07-10 16:59:59.999
37 2022-07-10 17:00:00.000 0.04452 0.04482 0.04451 0.0448 2022-07-10 17:29:59.999
38 2022-07-10 17:30:00.000 0.0448 0.04499 0.04475 0.04497 2022-07-10 17:59:59.999
39 2022-07-10 18:00:00.000 0.04488 0.045 0.04472 0.04499 2022-07-10 18:29:59.999
40 2022-07-10 18:30:00.000 0.04502 0.04539 0.04483 0.04535 2022-07-10 18:59:59.999
41 2022-07-10 19:00:00.000 0.04532 0.04559 0.04497 0.04509 2022-07-10 19:29:59.999
42 2022-07-10 19:30:00.000 0.0451 0.04531 0.04501 0.04514 2022-07-10 19:59:59.999
43 2022-07-10 20:00:00.000 0.04514 0.04528 0.04508 0.04525 2022-07-10 20:29:59.999
44 2022-07-10 20:30:00.000 0.04515 0.0452 0.04509 0.04517 2022-07-10 20:59:59.999
45 2022-07-10 21:00:00.000 0.04522 0.04546 0.04513 0.04514 2022-07-10 21:29:59.999
46 2022-07-10 21:30:00.000 0.04519 0.04534 0.04513 0.04534 2022-07-10 21:59:59.999
47 2022-07-10 22:00:00.000 0.04534 0.04534 0.04474 0.0449 2022-07-10 22:29:59.999
48 2022-07-10 22:30:00.000 0.0448 0.04507 0.04475 0.04482 2022-07-10 22:59:59.999
49 2022-07-10 23:00:00.000 0.04481 0.04481 0.04443 0.04455 2022-07-10 23:29:59.999
50 2022-07-10 23:30:00.000 0.04451 0.04451 0.04428 0.04442 2022-07-10 23:59:59.999
51 2022-07-11 00:00:00.000 0.04442 0.04442 0.04374 0.04382 2022-07-11 00:29:59.999
52 2022-07-11 00:30:00.000 0.0438 0.04413 0.04368 0.04399 2022-07-11 00:59:59.999
53 2022-07-11 01:00:00.000 0.04395 0.04402 0.04319 0.04342 2022-07-11 01:29:59.999
54 2022-07-11 01:30:00.000 0.04343 0.04348 0.04323 0.04329 2022-07-11 01:59:59.999
55 2022-07-11 02:00:00.000 0.04333 0.04338 0.04314 0.04331 2022-07-11 02:29:59.999
56 2022-07-11 02:30:00.000 0.04332 0.04332 0.04312 0.04318 2022-07-11 02:59:59.999
57 2022-07-11 03:00:00.000 0.04317 0.04337 0.04309 0.04322 2022-07-11 03:29:59.999
58 2022-07-11 03:30:00.000 0.04322 0.04327 0.04299 0.04307 2022-07-11 03:59:59.999
59 2022-07-11 04:00:00.000 0.04305 0.04313 0.04273 0.0428 2022-07-11 04:29:59.999
After you plot the price series above using the following lines:
import finplot as fplt
fplt.candlestick_ochl(df_trading_pair[['Start Date', 'Open Price', 'Close Price', 'High Price', 'Low Price']]).colors.update(dict(bull_frame='#26a69a', bull_body='#26a69a', bull_shadow='#26a69a'))
fplt.show()
You get its corresponding trading chart:
Then, you realize that this trading series clearly has an outlier, which is 0.043
and can be called by running df_trading_pair['Low Price'].iat[22]
in this case.
The thing is, how could a few lines of code identify that?
I was thinking of considering using the following formula:
In [2]:
(df_trading_pair["Close Price"]-df_trading_pair["Low Price"])/df_trading_pair["Low Price"]*100
Which returns the following data:
0 0.315126
1 0.084034
2 0.357744
3 0.277363
4 0.171563
5 0.494836
6 0.300687
7 0.043262
8 0.456224
9 0.152705
10 0.218531
11 0.945263
12 0.459016
13 0.480979
14 0.174596
15 0.395344
16 0.441306
17 0.440238
18 0.438693
19 0.414938
20 0.087489
21 0.219202
22 6.046512
23 0.153677
24 0.065905
25 0.021988
26 0.463372
27 0.307760
28 0.043908
29 0.176445
30 0.290828
31 0.315742
32 0.497625
33 0.247971
34 0.495719
35 0.294184
36 0.496614
37 0.651539
38 0.491620
39 0.603757
40 1.159938
41 0.266845
42 0.288825
43 0.377107
44 0.177423
45 0.022158
46 0.465322
47 0.357622
48 0.156425
49 0.270088
50 0.316170
51 0.182899
52 0.709707
53 0.532531
54 0.138793
55 0.394066
56 0.139147
57 0.301694
58 0.186090
59 0.163819
dtype: float64
I understand that the values above should move between from 0.021987
to 1.159938
and that 6.046512
is the outlier, but I'm not an expert in statistics so I don't know how exactly should I play with this, may I get some help here?
I finally managed to figure out a solution adjusted to my case, first I create a function called outlier
:
def outlier(df, col_name):
q1 = np.percentile(np.array(df[col_name].tolist()), 25)
q3 = np.percentile(np.array(df[col_name].tolist()), 75)
IQR = q3 - q1
Q3 = q1+(3*IQR)
Q1 = q3-(3*IQR)
outlier_num = 0
for value in df[col_name].values.tolist():
if (value < Q1) | (value > Q3):
print(f'the outlier value is {value}')
outlier_num +=1
return Q1, Q3, outlier_num
I create a sub-dataframe called df_open_low_price_range
which will store the series for further analysis with the aim of finding outlier values (in my particular case, the equation above is the right one to use in order to find the outlier value):
df_open_low_price_range = pd.DataFrame((df_trading_pair["Open Price"]-df_trading_pair["Low Price"])/df_trading_pair["Low Price"]*100, columns= ["Open Low Price Range"])
Then the outlier
function is executed which ends up returning a tuple that is stored in a variable called the_outlier
:
the_outlier = outlier(df_open_low_price_range, 'Open Low Price Range')
Finally, 2 conditions are set to know if there were outlier values or not in the series provided:
if the_outlier[2] == 0:
print("This series DOES NOT HAVE outliers")
elif the_outlier[2] >= 1:
print("This series DOES HAVE outliers")