Search code examples
pandasdataframemergeranking

Separating top and bottom n% of data frame and merging them again using Python


I have a data frame, consisting of PERMNO (ID), date and RET (predicted stock return), where I want to separate the top and bottom n% of observations based on RET's value for each date. Hereafter I want to rank them and them and merge the ranks with the original data frame.

If we focus on the date 1994-05-23, the data frame looks like the following:

print(EQ_port.loc[EQ_port['date'] == '1994-05-23'])

        PERMNO        date       RET
1360   10001.0  1994-05-23  0.000000
8557   10002.0  1994-05-23  0.120000
14628  10003.0  1994-05-23 -0.037050
17002  10009.0  1994-05-23 -0.038180
19983  10010.0  1994-05-23  0.000000
21049  10011.0  1994-05-23  0.011360
23332  10016.0  1994-05-23  0.000000
26431  10018.0  1994-05-23 -0.166600
28149  10019.0  1994-05-23 -0.040530
33427  10025.0  1994-05-23 -0.044770
40472  10026.0  1994-05-23  0.000000
49501  10032.0  1994-05-23 -0.051730
58517  10035.0  1994-05-23 -0.060000
62790  10037.0  1994-05-23  0.013885
66649  10039.0  1994-05-23  0.088260
69775  10042.0  1994-05-23  0.000000
74588  10043.0  1994-05-23  0.000000
77202  10046.0  1994-05-23 -0.081050
78740  10047.0  1994-05-23 -0.166600

I separate and rank each observation grouped by dates by the following:

a = 0.2

# Separating: 
EQ_top20 = (EQ_port.groupby('date',group_keys=False).apply(lambda x: x.nlargest(int(len(x) * a), 'RET')))
EQ_bottom20 = (EQ_port.groupby('date',group_keys=False).apply(lambda x: x.nsmallest(int(len(x) * a), 'RET')))

# Ranking:
EQ_top20['rank'] = EQ_top20.groupby("date", group_keys=True)['RET'].rank(method='average', ascending=True)
EQ_bottom20['rank'] = -EQ_bottom20.groupby("date", group_keys=True)['RET'].rank(method='average', ascending=False)

Which yields the following data frames:

print(EQ_top20.loc[EQ_top20['date'] == '1994-05-23'].sort_values('rank')):
        PERMNO        date       RET  rank
62790  10037.0  1994-05-23  0.013885   1.0
66649  10039.0  1994-05-23  0.088260   2.0
8557   10002.0  1994-05-23  0.120000   3.0

print(EQ_bottom20.loc[EQ_bottom20['date'] == '1994-05-23'].sort_values('rank')):
        PERMNO        date      RET  rank
26431  10018.0  1994-05-23 -0.16660  -2.5
78740  10047.0  1994-05-23 -0.16660  -2.5
77202  10046.0  1994-05-23 -0.08105  -1.0

I want to merge these ranks with the original data frame, but I can't figure out how. Do any of you have an idea of how to do this? What I want to get to is:

        PERMNO        date      RET  rank
1360   10001.0  1994-05-23  0.000000 0
8557   10002.0  1994-05-23  0.120000 3
14628  10003.0  1994-05-23 -0.037050 0
17002  10009.0  1994-05-23 -0.038180 0
19983  10010.0  1994-05-23  0.000000 0
21049  10011.0  1994-05-23  0.011360 0
23332  10016.0  1994-05-23  0.000000 0
26431  10018.0  1994-05-23 -0.166600 -2.5
28149  10019.0  1994-05-23 -0.040530 0
33427  10025.0  1994-05-23 -0.044770 0
40472  10026.0  1994-05-23  0.000000 0
49501  10032.0  1994-05-23 -0.051730 0
58517  10035.0  1994-05-23 -0.060000 0
62790  10037.0  1994-05-23  0.013885 1
66649  10039.0  1994-05-23  0.088260 2
69775  10042.0  1994-05-23  0.000000 0
74588  10043.0  1994-05-23  0.000000 0
77202  10046.0  1994-05-23 -0.081050 -1.0
78740  10047.0  1994-05-23 -0.166600 -2.5

Also, if you have any ideas of how to separate the top and bottom n% of observations grouped by date in a faster way, it would be greatly appreciated. I find that lambda functions are running quite slowly on large datasets.

Thank you in advance!


Solution

  • You can just concat the two to the original dataframe like so:

    ranks = pd.concat([EQ_top20, EQ_bottom20])["rank"]
    EQ_port = pd.concat([EQ_port, ranks], axis=1).fillna(0)
    

    Here, I concatenated the top 20 and the bottom 20 using:

    pd.concat([EQ_top20, EQ_bottom20])["rank"]
    

    ranks varaible looks like this:

    1     3.0
    14    2.0
    13    1.0
    7    -2.5
    18   -2.5
    17   -1.0
    Name: rank, dtype: float64
    

    Since the indexes of these values are still maintained after all the operations we can just concatenate this to the original dataframe and the values will concatenated at the correct index. So here, index 1 will get rank 3, index 14 will get 2, and so on. In your case the indexes are slightly different but it should work. Finally, I filled the empty values with zeros.

    Final output looks like this:

         PERMNO        date       RET  rank
    0   10001.0  1994-05-23  0.000000   0.0
    1   10002.0  1994-05-23  0.120000   3.0
    2   10003.0  1994-05-23 -0.037050   0.0
    3   10009.0  1994-05-23 -0.038180   0.0
    4   10010.0  1994-05-23  0.000000   0.0
    5   10011.0  1994-05-23  0.011360   0.0
    6   10016.0  1994-05-23  0.000000   0.0
    7   10018.0  1994-05-23 -0.166600  -2.5
    8   10019.0  1994-05-23 -0.040530   0.0
    9   10025.0  1994-05-23 -0.044770   0.0
    10  10026.0  1994-05-23  0.000000   0.0
    11  10032.0  1994-05-23 -0.051730   0.0
    12  10035.0  1994-05-23 -0.060000   0.0
    13  10037.0  1994-05-23  0.013885   1.0
    14  10039.0  1994-05-23  0.088260   2.0
    15  10042.0  1994-05-23  0.000000   0.0
    16  10043.0  1994-05-23  0.000000   0.0
    17  10046.0  1994-05-23 -0.081050  -1.0
    18  10047.0  1994-05-23 -0.166600  -2.5