Search code examples
pythonpandasnumpyschedule

Editing the order and content of schedule file in Pandas, Python based on condition


I'm trying to edit a schedule file in pandas, python 3 and am very stuck at the moment.

Basically, I have a schedule file like this:

id    trip_id    origin    destination    courier_status     package_origin   package_destination
1        1         A           B              False                nan               nan
1        2         B           C               True                 X                 Y
2        1         F           G              False                nan               nan
2        2         G           H               True                 Q                 R
2        3         H           I              False                nan               nan

If the courier_status is true, I want them (the person in id) to make a detour to the package_origin and package_destination before continuing to destination, thus changing their schedule file. Ideally, the new schedule file is supposed to be like this, newSchedule:

id    trip_id    origin    destination   status
1        1         A           B         normal
1        2         B           X         courier
1        3         X           Y         courier
1        4         Y           C         normal      
2        1         F           G         normal
2        2         G           Q         courier
2        3         Q           R         courier
2        4         R           H         normal
2        5         H           I         normal

My idea was to make a new df, consisting of only additional trips, then append them to the existing schedule, then remove the duplicates and keep='last', afterwards apply the sort_values on id. However, I haven't been able to make the newSchedule DataFrame. Can anybody give me a hand or give me a direction on what kind of algorithm should I use? I was thinking of using a loop or using np.where?

The real data have many more columns and row, I just want to know how can I work with this. I am a rookie on working with python so I'm very lost at the moment.

Please help!


Solution

  • Here's one option. First, you can split up the DataFrame based on the courier_status column. Many ways to do this, here I use a groupby:

    (_, df_n), (_, df_c) = df.groupby('courier_status')
    

    The normal DataFrame is easy to deal with, just drop some columns and assign the status:

    df_n['status'] = 'normal'
    df_n = df_n.drop(columns=['courier_status', 'package_origin', 'package_destination', 'trip_id'])
    

    The courier DataFrame takes a lot more work. Here we'll need to form the chain from ['origin', 'package_origin', 'package_destination', 'destination'], which can be accomplished by specifying that order, stacking and joining together a shifted version. There's some cleaning up of things I threw in the Index that we needed to keep. Finally assign everything but the last 'package_origin' -> 'package_destination' part a status of 'courier'.

    s = (df_c.set_index(['id'], append=True)
            [['origin', 'package_origin', 'package_destination', 'destination']].stack()
        )
    
    df_c = (pd.concat([s.rename('origin'), s.groupby(level=0).shift(-1).rename('destination')], axis=1)
              .dropna()
              .reset_index(['id'])
              .reset_index(-1, drop=True)
              .assign(status='courier'))
    
    df_c.loc[~df_c.index.duplicated(keep='last'), 'status'] = 'normal'
    

    Finally, because we kept the original index all of this time, we can concat the two together and then sort_index to put the rows in the order they should appear, and define the 'trip_id' using a groupby + cumcount:

    result = pd.concat([df_n, df_c]).sort_index()
    result['trip_id'] = result.groupby('id').cumcount()+1
    
    #   id origin destination   status  trip_id
    #0   1      A           B   normal        1
    #1   1      B           X  courier        2
    #1   1      X           Y  courier        3
    #1   1      Y           C   normal        4
    #2   2      F           G   normal        1
    #3   2      G           Q  courier        2
    #3   2      Q           R  courier        3
    #3   2      R           H   normal        4
    #4   2      H           I   normal        5