I have a data frame, consisting of PERMNO (ID), date and RET (predicted stock return), where I want to separate the top and bottom n% of observations based on RET's value for each date. Hereafter I want to rank them and them and merge the ranks with the original data frame.
If we focus on the date 1994-05-23, the data frame looks like the following:
print(EQ_port.loc[EQ_port['date'] == '1994-05-23'])
PERMNO date RET
1360 10001.0 1994-05-23 0.000000
8557 10002.0 1994-05-23 0.120000
14628 10003.0 1994-05-23 -0.037050
17002 10009.0 1994-05-23 -0.038180
19983 10010.0 1994-05-23 0.000000
21049 10011.0 1994-05-23 0.011360
23332 10016.0 1994-05-23 0.000000
26431 10018.0 1994-05-23 -0.166600
28149 10019.0 1994-05-23 -0.040530
33427 10025.0 1994-05-23 -0.044770
40472 10026.0 1994-05-23 0.000000
49501 10032.0 1994-05-23 -0.051730
58517 10035.0 1994-05-23 -0.060000
62790 10037.0 1994-05-23 0.013885
66649 10039.0 1994-05-23 0.088260
69775 10042.0 1994-05-23 0.000000
74588 10043.0 1994-05-23 0.000000
77202 10046.0 1994-05-23 -0.081050
78740 10047.0 1994-05-23 -0.166600
I separate and rank each observation grouped by dates by the following:
a = 0.2
# Separating:
EQ_top20 = (EQ_port.groupby('date',group_keys=False).apply(lambda x: x.nlargest(int(len(x) * a), 'RET')))
EQ_bottom20 = (EQ_port.groupby('date',group_keys=False).apply(lambda x: x.nsmallest(int(len(x) * a), 'RET')))
# Ranking:
EQ_top20['rank'] = EQ_top20.groupby("date", group_keys=True)['RET'].rank(method='average', ascending=True)
EQ_bottom20['rank'] = -EQ_bottom20.groupby("date", group_keys=True)['RET'].rank(method='average', ascending=False)
Which yields the following data frames:
print(EQ_top20.loc[EQ_top20['date'] == '1994-05-23'].sort_values('rank')):
PERMNO date RET rank
62790 10037.0 1994-05-23 0.013885 1.0
66649 10039.0 1994-05-23 0.088260 2.0
8557 10002.0 1994-05-23 0.120000 3.0
print(EQ_bottom20.loc[EQ_bottom20['date'] == '1994-05-23'].sort_values('rank')):
PERMNO date RET rank
26431 10018.0 1994-05-23 -0.16660 -2.5
78740 10047.0 1994-05-23 -0.16660 -2.5
77202 10046.0 1994-05-23 -0.08105 -1.0
I want to merge these ranks with the original data frame, but I can't figure out how. Do any of you have an idea of how to do this? What I want to get to is:
PERMNO date RET rank
1360 10001.0 1994-05-23 0.000000 0
8557 10002.0 1994-05-23 0.120000 3
14628 10003.0 1994-05-23 -0.037050 0
17002 10009.0 1994-05-23 -0.038180 0
19983 10010.0 1994-05-23 0.000000 0
21049 10011.0 1994-05-23 0.011360 0
23332 10016.0 1994-05-23 0.000000 0
26431 10018.0 1994-05-23 -0.166600 -2.5
28149 10019.0 1994-05-23 -0.040530 0
33427 10025.0 1994-05-23 -0.044770 0
40472 10026.0 1994-05-23 0.000000 0
49501 10032.0 1994-05-23 -0.051730 0
58517 10035.0 1994-05-23 -0.060000 0
62790 10037.0 1994-05-23 0.013885 1
66649 10039.0 1994-05-23 0.088260 2
69775 10042.0 1994-05-23 0.000000 0
74588 10043.0 1994-05-23 0.000000 0
77202 10046.0 1994-05-23 -0.081050 -1.0
78740 10047.0 1994-05-23 -0.166600 -2.5
Also, if you have any ideas of how to separate the top and bottom n% of observations grouped by date in a faster way, it would be greatly appreciated. I find that lambda functions are running quite slowly on large datasets.
Thank you in advance!
You can just concat
the two to the original dataframe like so:
ranks = pd.concat([EQ_top20, EQ_bottom20])["rank"]
EQ_port = pd.concat([EQ_port, ranks], axis=1).fillna(0)
Here, I concatenated the top 20 and the bottom 20 using:
pd.concat([EQ_top20, EQ_bottom20])["rank"]
ranks
varaible looks like this:
1 3.0
14 2.0
13 1.0
7 -2.5
18 -2.5
17 -1.0
Name: rank, dtype: float64
Since the indexes of these values are still maintained after all the operations we can just concatenate this to the original dataframe and the values will concatenated at the correct index. So here, index 1 will get rank 3, index 14 will get 2, and so on. In your case the indexes are slightly different but it should work. Finally, I filled the empty values with zeros.
Final output looks like this:
PERMNO date RET rank
0 10001.0 1994-05-23 0.000000 0.0
1 10002.0 1994-05-23 0.120000 3.0
2 10003.0 1994-05-23 -0.037050 0.0
3 10009.0 1994-05-23 -0.038180 0.0
4 10010.0 1994-05-23 0.000000 0.0
5 10011.0 1994-05-23 0.011360 0.0
6 10016.0 1994-05-23 0.000000 0.0
7 10018.0 1994-05-23 -0.166600 -2.5
8 10019.0 1994-05-23 -0.040530 0.0
9 10025.0 1994-05-23 -0.044770 0.0
10 10026.0 1994-05-23 0.000000 0.0
11 10032.0 1994-05-23 -0.051730 0.0
12 10035.0 1994-05-23 -0.060000 0.0
13 10037.0 1994-05-23 0.013885 1.0
14 10039.0 1994-05-23 0.088260 2.0
15 10042.0 1994-05-23 0.000000 0.0
16 10043.0 1994-05-23 0.000000 0.0
17 10046.0 1994-05-23 -0.081050 -1.0
18 10047.0 1994-05-23 -0.166600 -2.5