After using groupby on certain columns of a dataframe, and subsequently using apply to test whether a string exists in another column, pandas only return those columns that were grouped by and the last column created with the apply. Is it possible to return all columns associated with the groupby by and test? For instance grouping by unique identifiers for a conversation thread, and testing whether a string exists in another column, but then include some other columns that exist in the dataframe but belong to the specific group?
I've tried using a groupby, followed by using apply for an anonymous function.
df.head()
shipment_id shipper_id courier_id Question sender
0 14 9962 228898 Let's get your furbabys home Apple pet transpo... courier
1 91919 190872 196838 Hi I'm kevin thims and I'm happy to do the job... courier
2 92187 191128 196838 Hi I'm kevin thims and I'm happy to do the job... shipper
unique_thread_indentifier = ['shipment_id', 'shipper_id', 'courier_id']
required_variables = ['shipment_id', 'shipper_id', 'courier_id', 'Question', 'sender']
df_new = (
df
.groupby(unique_thread_indentifier)[required_variables]
.apply(lambda group: 'shipper' in group['sender'].unique())
.to_frame(name='shipper_replied')
.reset_index()
)
df_new.head()
shipment_id shipper_id courier_id shipper_replied
0 14 9962 228898 False
1 91919 190872 196838 False
2 92187 191128 196838 True
What I'm aiming to do is to include the column Question
and sender
back in the final dataframe. Expected output would look like:
shipment_id shipper_id courier_id Question sender shipper_replied
0 14 9962 228898 Let's get your furbabys home Apple pet transpo... courier False
1 91919 190872 196838 Hi I'm kevin thims and I'm happy to do the job... courier False
2 92187 191128 196838 Hi I'm kevin thims and I'm happy to do the job... shipper True
I believe you need GroupBy.transform
:
df['shipper_replied'] = (df.groupby(unique_thread_indentifier)['sender']
.transform(lambda group: 'shipper' in group.unique()))
print (df)
shipment_id shipper_id courier_id \
0 14 9962 228898
1 91919 190872 196838
2 92187 191128 196838
Question sender shipper_replied
0 Let's get your furbabys home Apple pet transpo. courier False
1 Hi I'm kevin thims and I'm happy to do the job courier False
2 Hi I'm kevin thims and I'm happy to do the job shipper True
Another solution:
df['shipper_replied'] = (df.assign(new = df['sender'].eq('shipper'))
.groupby(unique_thread_indentifier)['new']
.transform('any'))
print (df)
shipment_id shipper_id courier_id \
0 14 9962 228898
1 91919 190872 196838
2 92187 191128 196838
Question sender shipper_replied
0 Let's get your furbabys home Apple pet transpo. courier False
1 Hi I'm kevin thims and I'm happy to do the job courier False
2 Hi I'm kevin thims and I'm happy to do the job shipper True