Search code examples
pythonpandassortinggroup-byranking

Sorting and ranking by dates, on a group in a pandas df


From the following sort of dataframe I would like to be able to both sort and rank the id field on date:

df = pd.DataFrame({
'id':[1, 1, 2, 3, 3, 4, 5, 6,6,6,7,7], 
'value':[.01, .4, .2, .3, .11, .21, .4, .01, 3, .5, .8, .9],
'date':['10/01/2017 15:45:00','05/01/2017 15:56:00',
        '11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00',
        '05/01/2017 09:55:00','05/01/2017 10:08:00','03/02/2017 08:55:00',
        '03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
        '19/01/2017 16:34:00']})

to effectively rank or index, per id, based on date.

I've used

df.groupby('id')['date'].min()

which allows me to extract the first date (although I don't know how to use this to filter out the rows) but I might not always need the first date - sometimes it will be the second or third so I need to generate a new column, with an index for the date - the result would look like:

enter image description here

Any ideas on this sorting/ranking/labelling?

EDIT

My original model ignored a very prevalent issue.

As there are feasibly some ids that have multiple tests performed on them in parallel, therefore they show in multiple rows in the datebase, with matching dates (date corresponds to when they were logged). These should be counted as the same date and not increment the date_rank: I've generated a model, with updated date_rank to demonstrate how this would look:

df = pd.DataFrame({
'id':[1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6,6,6,7,7], 
'value':[.01, .4, .5, .7, .77, .1,.2, 0.3, .11, .21, .4, .01, 3, .5, .8, .9, .1],
'date':['10/01/2017 15:45:00','10/01/2017 15:45:00','05/01/2017 15:56:00',
        '11/01/2017 15:22:00','11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00','05/01/2017 09:37:00','05/01/2017 09:55:00',
        '05/01/2017 09:55:00','05/01/2017 10:08:00','05/01/2017 10:09:00','03/02/2017 08:55:00',
        '03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
        '19/01/2017 16:34:00']})

And the counter would afford this:

enter image description here


Solution

  • You can try of sorting date values in descending and aggregating the 'id' group values

    @praveen's logic is very simpler, by extending of logic, you can use astype of category to convert the values to categories and can retrive the codes (keys') of that categories, but it will be little bit different to your expected output

    df1 = df.sort_values(['id', 'date'], ascending=[True, False])
    df1['date_rank'] =df1.groupby(['id']).apply(lambda x: x['date'].astype('category',ordered=False).cat.codes+1).values
    

    Out:

                     date   id  value   date_rank
    0   10/01/2017 15:45:00 1   0.01    2
    1   10/01/2017 15:45:00 1   0.40    2
    2   05/01/2017 15:56:00 1   0.50    1
    3   11/01/2017 15:22:00 2   0.70    1
    4   11/01/2017 15:22:00 2   0.77    1
    5   06/01/2017 11:02:00 3   0.10    2
    6   05/01/2017 09:37:00 3   0.20    1
    7   05/01/2017 09:37:00 3   0.30    1
    8   05/01/2017 09:55:00 4   0.11    1
    9   05/01/2017 09:55:00 4   0.21    1
    11  05/01/2017 10:09:00 5   0.01    2
    10  05/01/2017 10:08:00 5   0.40    1
    14  03/02/2017 09:31:00 6   0.80    3
    13  03/02/2017 09:15:00 6   0.50    2
    12  03/02/2017 08:55:00 6   3.00    1
    16  19/01/2017 16:34:00 7   0.10    2
    15  09/01/2017 15:42:00 7   0.90    1
    

    but to get your exact output, here i have used dictionary and reversing dictionary keys with extracting values

    df1 = df.sort_values(['id', 'date'], ascending=[True, False])
    df1['date_rank'] = df1.groupby(['id'])['date'].transform(lambda x: list(map(lambda y: dict(map(reversed, dict(enumerate(x.unique())).items()))[y]+1,x)) )
    

    Out:

                    date    id  value   date_rank
    0   10/01/2017 15:45:00 1   0.01    1
    1   10/01/2017 15:45:00 1   0.40    1
    2   05/01/2017 15:56:00 1   0.50    2
    3   11/01/2017 15:22:00 2   0.70    1
    4   11/01/2017 15:22:00 2   0.77    1
    5   06/01/2017 11:02:00 3   0.10    1
    6   05/01/2017 09:37:00 3   0.20    2
    7   05/01/2017 09:37:00 3   0.30    2
    8   05/01/2017 09:55:00 4   0.11    1
    9   05/01/2017 09:55:00 4   0.21    1
    11  05/01/2017 10:09:00 5   0.01    1
    10  05/01/2017 10:08:00 5   0.40    2
    14  03/02/2017 09:31:00 6   0.80    1
    13  03/02/2017 09:15:00 6   0.50    2
    12  03/02/2017 08:55:00 6   3.00    3
    16  19/01/2017 16:34:00 7   0.10    1
    15  09/01/2017 15:42:00 7   0.90    2