Search code examples
pythonpandaspandas-groupbyunique

Compare each element in groupby() group to the unique values in that group and get the location of equality


I have a dataframe with subjects and dates for a certain measurement. For each subject I want to find if the date in each row of the group corresponds to the first (1), second (2), third (3)... unique date value for that subject.

To clarify this is what I am looking for:

    |subject | date | order|
    |A | 01.01.2020 | 1|
    |A | 01.01.2020 | 1|
    |A | 02.01.2020 | 2|
    |B | 01.01.2020 | 1|
    |B | 02.01.2020 | 2|
    |B | 02.01.2020 | 2|

I though about something as bellow, but the for loop is not admissible in the apply function:

df['order']=df.groupby(['subject']).apply(lambda x: i if x['date']=value for i, value in enumerate(x['date'].unique()))

Is there a straightforward way to do this?


Solution

  • Use factorize in GroupBy.transform :

    df['order1']=df.groupby(['subject'])['date'].transform(lambda x: pd.factorize(x)[0]) + 1
    print (df)
      subject        date  order  order1
    0       A  01.01.2020      1       1
    1       A  01.01.2020      1       1
    2       A  02.01.2020      2       2
    3       B  01.01.2020      1       1
    4       B  02.01.2020      2       2
    5       B  02.01.2020      2       2
    

    Or you can use GroupBy.rank, but is necessary convert column date to datetimes:

    df['order2']=df.groupby(['subject'])['date'].rank(method='dense')
    print (df)
      subject       date  order  order1
    0       A 2020-01-01      1     1.0
    1       A 2020-01-01      1     1.0
    2       A 2020-02-01      2     2.0
    3       B 2020-01-01      1     1.0
    4       B 2020-02-01      2     2.0
    5       B 2020-02-01      2     2.0
    

    Difference of solution is if changed order of datetimes:

    print (df)
      subject       date  order (disregarding temporal order of date)
    0       A 2020-01-01      1
    1       A 2020-03-01      2 <- changed datetime for sample
    2       A 2020-02-01      3
    3       B 2020-01-01      1
    4       B 2020-02-01      2
    5       B 2020-02-01      2
    
    df['order1']=df.groupby(['subject'])['date'].transform(lambda x: pd.factorize(x)[0]) + 1
    df['order2']=df.groupby(['subject'])['date'].rank(method='dense')
    print (df)
      subject       date  order  order1  order2
    0       A 2020-01-01      1       1     1.0
    1       A 2020-03-01      1       2     3.0
    2       A 2020-02-01      2       3     2.0
    3       B 2020-01-01      1       1     1.0
    4       B 2020-02-01      2       2     2.0
    5       B 2020-02-01      2       2     2.0
    

    In summary: use the first method if you don't care about the temporal order of date being reflected in the order output, or the second method if the temporal order matters and should reflect in the order output.