Search code examples
pythonpandasqliksense

How to Link the trip based on condition and assign unique key?


I have a trip data where I have start date, end date, start location and end location. I need to generate a key for this data set.

First, sort the data based on start date and then for every start and end date pair if the end location of previous row and start location of current row are same, assign same ID else assign new id. If date are different by default assign new ID.

This is the output I require. [For input just delete the Key column]

Start        End     Start Location   End Location  Key
06-11-2020   17-11-2020     A           B           1
17-11-2020   17-11-2020     B           C           2
17-11-2020   17-11-2020     C           D           2
19-11-2020   19-11-2020     E           F           3
19-11-2020   19-11-2020     G           A           4
19-11-2020   19-11-2020     A           H           4
19-11-2020   19-11-2020     H           G           4
19-11-2020   19-11-2020     G           A           4
20-11-2020   20-11-2020     E           H           5
20-11-2020   20-11-2020     H           I           5
20-11-2020   20-11-2020     J           A           6
21-11-2020   21-11-2020     E           G           7
21-11-2020   21-11-2020     K           A           8
21-11-2020   21-11-2020     E           A           9
21-11-2020   21-11-2020     E           G           10
21-11-2020   21-11-2020     G           A           10 

Solution

  • You need another key to group with Start and End Start columns according your requirement (check if the current Location is the same as previous End Location). After that, use ngroup to get the group number:

    df['Key'] = (df.sort_values('Start')
                   .assign(Loc=df['Location'].ne(df['End Location'].shift()).cumsum())
                   .groupby(['Start', 'End Start', 'Loc'])
                   .ngroup().add(1).drop(columns='Loc'))
    

    Output:

    >>> df
             Start   End Start Location End Location  Key
    0   06-11-2020  17-11-2020        A            B    1
    1   17-11-2020  17-11-2020        B            C    2
    2   17-11-2020  17-11-2020        C            D    2
    3   19-11-2020  19-11-2020        E            F    3
    4   19-11-2020  19-11-2020        G            A    4
    5   19-11-2020  19-11-2020        A            H    4
    6   19-11-2020  19-11-2020        H            G    4
    7   19-11-2020  19-11-2020        G            A    4
    8   20-11-2020  20-11-2020        E            H    5
    9   20-11-2020  20-11-2020        H            I    5
    10  20-11-2020  20-11-2020        J            A    6
    11  21-11-2020  21-11-2020        E            G    7
    12  21-11-2020  21-11-2020        K            A    8
    13  21-11-2020  21-11-2020        E            A    9
    14  21-11-2020  21-11-2020        E            G   10
    15  21-11-2020  21-11-2020        G            A   10
    

    Details about Loc:

    # Loc = df['Location'].ne(df['End Location'].shift()).cumsum().rename('Loc')
    >>> pd.concat([df[['Location', 'End Location']], Loc], axis=1)
       Location End Location  Loc
    0         A            B    1
    1         B            C    1
    2         C            D    1
    3         E            F    2
    4         G            A    3
    5         A            H    3
    6         H            G    3
    7         G            A    3
    8         E            H    4
    9         H            I    4
    10        J            A    5
    11        E            G    6
    12        K            A    7
    13        E            A    8
    14        E            G    9
    15        G            A    9