Search code examples
pandaspandas-groupbypandas-apply

How to force Pandas apply to return all columns of parent dataframe?


After using groupby on certain columns of a dataframe, and subsequently using apply to test whether a string exists in another column, pandas only return those columns that were grouped by and the last column created with the apply. Is it possible to return all columns associated with the groupby by and test? For instance grouping by unique identifiers for a conversation thread, and testing whether a string exists in another column, but then include some other columns that exist in the dataframe but belong to the specific group?

I've tried using a groupby, followed by using apply for an anonymous function.

df.head()

 shipment_id shipper_id courier_id  Question                                sender
0   14      9962    228898  Let's get your furbabys home Apple pet transpo...   courier
1   91919   190872  196838  Hi I'm kevin thims and I'm happy to do the job...   courier
2   92187   191128  196838  Hi I'm kevin thims and I'm happy to do the job...   shipper

unique_thread_indentifier = ['shipment_id', 'shipper_id', 'courier_id']
required_variables = ['shipment_id', 'shipper_id', 'courier_id', 'Question', 'sender']

df_new = (
    df
    .groupby(unique_thread_indentifier)[required_variables]
    .apply(lambda group: 'shipper' in group['sender'].unique())
    .to_frame(name='shipper_replied')
    .reset_index()
)

df_new.head()
    shipment_id shipper_id  courier_id  shipper_replied
0   14      9962            228898          False
1   91919   190872          196838          False
2   92187   191128          196838          True

What I'm aiming to do is to include the column Question and sender back in the final dataframe. Expected output would look like:

 shipment_id shipper_id courier_id  Question                                sender        shipper_replied
0   14      9962    228898  Let's get your furbabys home Apple pet transpo...   courier       False
1   91919   190872  196838  Hi I'm kevin thims and I'm happy to do the job...   courier       False
2   92187   191128  196838  Hi I'm kevin thims and I'm happy to do the job...   shipper       True

Solution

  • I believe you need GroupBy.transform:

    df['shipper_replied'] = (df.groupby(unique_thread_indentifier)['sender']
                               .transform(lambda group: 'shipper' in group.unique()))
    
    print (df)
       shipment_id  shipper_id  courier_id  \
    0           14        9962      228898   
    1        91919      190872      196838   
    2        92187      191128      196838   
    
                                              Question   sender  shipper_replied  
    0  Let's get your furbabys home Apple pet transpo.  courier            False  
    1   Hi I'm kevin thims and I'm happy to do the job  courier            False  
    2   Hi I'm kevin thims and I'm happy to do the job  shipper             True  
    

    Another solution:

    df['shipper_replied'] = (df.assign(new = df['sender'].eq('shipper'))
                               .groupby(unique_thread_indentifier)['new']
                               .transform('any'))
    
    print (df)
       shipment_id  shipper_id  courier_id  \
    0           14        9962      228898   
    1        91919      190872      196838   
    2        92187      191128      196838   
    
                                              Question   sender  shipper_replied  
    0  Let's get your furbabys home Apple pet transpo.  courier            False  
    1   Hi I'm kevin thims and I'm happy to do the job  courier            False  
    2   Hi I'm kevin thims and I'm happy to do the job  shipper             True