Search code examples
pythonpandasdataframegroup-bypandas-resample

Pandas - How to merge rows based on a criteria


I have a Pandas dataframe with thousands of rows that I want to combine to smaller number of rows.

In the new dataframe, I want:

  • Certain columns be summation of grouped rows
  • Some columns be average of grouped rows, and
  • Some other columns be weighted averaged of the same column with weight factor being another column (in the original dataframe).

my search shows I might use agg function to achieve the above piece (not too sure). But hardest part is to write the criteria with Pandas that select the number of rows needed to merged. Here is an example:

df = pd.DataFrame({'col1': [1, 1, 2, 2, 3, 4, 2],
                   'col2': [10, 20, 30, 40, 50, 60, 70],
                   'col3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7],
                   'col4': [2, 1, 4, 5, 8, 10, 3]})

My criteria for the number of rows to be merged is so that summation of column 1 values for them to be above a certain threshold. Assume it's 3 here:

  • rows 1,2 and 3 be grouped (sum is 4 >3).
  • rows 4 and 5 be grouped together (sum is 5>3)
  • row 6 would not be grouped (4 > 3), stays same in the new dataframe.
  • row 7 would not be grouped since no other row exists, stays same in the new dataframe.

Final dataframe would look like this:

Columns 1 and 2 are sum of grouped rows. Column 3 is average of grouped rows. Column 3 is weighted average of grouped rows with weight factor being column 1.

newdf = pd.DataFrame({'col1': [4, 5, 4, 2],
                   'col2': [60, 90, 60, 70],
                   'col3': [0.2, 0.45, 0.6, 0.7],
                   'col4': [2.75, 6.8, 10, 3]})

Can someone help me?

I looked into something like this but two issues remained: the grouping criteria doesn't work, and not sure how the weighted average

agg_funcs = {'col1': 'sum', 'col2': 'sum','col3': 'mean', 'col4': lambda x: (x['col1'] * x['col4']).sum() / x['col1'].sum()}

grouped_df = df.groupby('col1').agg(agg_funcs)

The code failed and complained about the weighted average piece, but the grouping criteria is not correct too!


Solution

  • You need to change two things:

    • the way you group the values: you can't group by cumulated value until a threshold in a vectorial way. You need a loop.
    • the way you compute the weighted average can't work with agg. agg only works by column/Series without being aware of the other columns. You can however pre-compute the weights, sum, then divide by the sum of weights.
    def threshold_grouper(s, thresh=3):
        group = []
        i = 0
        total = 0
        for val in s:
            total += val
            group.append(i)
            if total>=thresh:
                i+=1
                total=0
        return group
    
    (df.eval('col4 = col4*col1')
       .groupby(threshold_grouper(df['col1']))
       .agg({'col1': 'sum', 'col2': 'sum', 'col3': 'mean', 'col4': 'sum'})
       .eval('col4 = col4/col1')
    )
    

    You can also cheat and use a side effect in agg, accessing "col1" externally:

    (df.groupby(threshold_grouper(df['col1']))
       .agg({'col1': 'sum', 'col2': 'sum', 'col3': 'mean',
             'col4': lambda g: np.average(g, weights=df['col1'].reindex_like(g))})
    )
    

    Output:

       col1  col2  col3   col4
    0     4    60  0.20   2.75
    1     5    90  0.45   6.80
    2     4    60  0.60  10.00
    3     2    70  0.70   3.00