I'm trying to edit a schedule file in pandas, python 3 and am very stuck at the moment.
Basically, I have a schedule
file like this:
id trip_id origin destination courier_status package_origin package_destination
1 1 A B False nan nan
1 2 B C True X Y
2 1 F G False nan nan
2 2 G H True Q R
2 3 H I False nan nan
If the courier_status
is true, I want them (the person in id
) to make a detour to the package_origin
and package_destination
before continuing to destination
, thus changing their schedule file. Ideally, the new schedule file is supposed to be like this, newSchedule
:
id trip_id origin destination status
1 1 A B normal
1 2 B X courier
1 3 X Y courier
1 4 Y C normal
2 1 F G normal
2 2 G Q courier
2 3 Q R courier
2 4 R H normal
2 5 H I normal
My idea was to make a new df
, consisting of only additional trips, then append them to the existing schedule
, then remove the duplicates and keep='last'
, afterwards apply the sort_values
on id
. However, I haven't been able to make the newSchedule
DataFrame. Can anybody give me a hand or give me a direction on what kind of algorithm should I use? I was thinking of using a loop or using np.where
?
The real data have many more columns and row, I just want to know how can I work with this. I am a rookie on working with python so I'm very lost at the moment.
Please help!
Here's one option. First, you can split up the DataFrame based on the courier_status column. Many ways to do this, here I use a groupby:
(_, df_n), (_, df_c) = df.groupby('courier_status')
The normal DataFrame is easy to deal with, just drop some columns and assign the status:
df_n['status'] = 'normal'
df_n = df_n.drop(columns=['courier_status', 'package_origin', 'package_destination', 'trip_id'])
The courier DataFrame takes a lot more work. Here we'll need to form the chain from ['origin', 'package_origin', 'package_destination', 'destination']
, which can be accomplished by specifying that order, stacking and joining together a shifted version. There's some cleaning up of things I threw in the Index that we needed to keep. Finally assign everything but the last 'package_origin' -> 'package_destination' part a status of 'courier'.
s = (df_c.set_index(['id'], append=True)
[['origin', 'package_origin', 'package_destination', 'destination']].stack()
)
df_c = (pd.concat([s.rename('origin'), s.groupby(level=0).shift(-1).rename('destination')], axis=1)
.dropna()
.reset_index(['id'])
.reset_index(-1, drop=True)
.assign(status='courier'))
df_c.loc[~df_c.index.duplicated(keep='last'), 'status'] = 'normal'
Finally, because we kept the original index all of this time, we can concat
the two together and then sort_index
to put the rows in the order they should appear, and define the 'trip_id' using a groupby
+ cumcount
:
result = pd.concat([df_n, df_c]).sort_index()
result['trip_id'] = result.groupby('id').cumcount()+1
# id origin destination status trip_id
#0 1 A B normal 1
#1 1 B X courier 2
#1 1 X Y courier 3
#1 1 Y C normal 4
#2 2 F G normal 1
#3 2 G Q courier 2
#3 2 Q R courier 3
#3 2 R H normal 4
#4 2 H I normal 5