Search code examples
pythonpandasgroup-byconcatenation

Concat Dataframe to other DataFrame inside python function


 def fun(output_data):

        dic_ = dict.fromkeys(output_data.columns, "first")
        dic_.pop("col1")
        dic_.pop('col2')
        dic_.update({
            'col9': "sum",
            'col10': "sum",
            'col11': "sum",
            'col12': "sum",
        })

        tmp = output_data[output_data['col100'].eq('B2C')].groupby(
            ['col1', 'col2'], sort=False, as_index=False).agg(dic_)[list(output_data.columns)].reset_index(
            drop=True)

        output_data = pd.concat(
            [tmp,
             output_data[output_data['col100'].ne('B2C')]])

I have a data frame where I have to filter then group by and then aggregate on certain columns. But after concatenating I want to change the data frame coming as argument in the function. I tried to do this way but not getting the desired result.

There is no option of inplace=True in pd.concat()

Example:

Input DataFrame

col1    col2    col3 col4 col5 ...... col100
fixval  fixval  12   'a'   'b' ...... B2C 
fixval  fixval  12   'a'   'c' ...... B2C 
fixval  fixval  12   'a'   'b' ...... B2C 
fixval  fixval  12   'a'   'b' ...... B2C 
fixval  fixval  12   'b'   'a' ...... B2B
fixval  fixval  12   'b'   'a' ...... B2B 

Output dataFrame

col1    col2    col3 col4 col5 ...... col100
fixval  fixval  36   'a'   'b' ...... B2C 
fixval  fixval  12   'a'   'c' ...... B2C 
fixval  fixval  12   'b'   'a' ...... B2B
fixval  fixval  12   'b'   'a' ...... B2B 

Grouping done on col4 and col5 and filtering done on col100 where value = B2C.

Then I need to assign it back to original dataframe which is coming as argument to the function.


Solution

  • One of the solution that I could find till now.

    It may not be the best solution but fulfills the need to update data frame inside function.

    output_data.drop(output_data.loc[output_data['col100'].eq('B2C')].index, inplace=True)
    
    for idx, row in tmp.iterrows():
        output_data.loc[len(output_data)] = row