Search code examples
pythonpython-3.xpandasdataframesum

Sum rows based on columns inside pandas dataframe


I am quite new to pandas, but I use python at a good level.

I have a pandas dataframe which is organized as follows

idrun    idbasin    time    q
-192540      1        0     0
-192540      1        1     0.5
...
-192540      2        0     0
-192540      2        1     1
...
-192540      3        0     0
-192540      3        1     1
...
-192541      1        0     0
-192541      1        1     0.5
...
-192541      2        0     0
-192541      2        1     1
...
-192541      3        0     0
-192541      3        1     1
...

It is a fairly large dataframe (7 columns and ~600k rows).

What I would like to do is: given a tuple containing values referring to the idbasin column (e.g. (1,2)), if the idrun value is the same

  1. sum the q column of the referred idbasin values, i.e. for the example it would be (1,2)
  2. remove the rows corresponding to that idrun value and the tuple-specified idbasin values
  3. insert the summed values with idbasin equal to the first number of the tuple.

Referring to my example df, the results would be

idrun    idbasin    time    q
-192540      1        0     0
-192540      1        1     1.5
...
-192540      3        0     0
-192540      3        1     1
...
-192541      1        0     0
-192541      1        1     1.5
...
-192541      3        0     0
-192541      3        1     1
...

My solution would to use groupby to turn the df to a dict and then do the operation with one or two for loops, but I understand that iterating in pandas is not the optimal solution, so I believe there could be a "pandas" solution using the df.


Solution

  • You can replace values of tuple by first value of tuple in Series.mask and then aggregate sum:

    tup = (1, 2)
    
    df['idbasin'] = df['idbasin'].mask(df['idbasin'].isin(tup), tup[0])
    #alternative
    #df['idbasin'] = np.where(df['idbasin'].isin(tup), tup[0], df['idbasin'])
    df = df.groupby(['idrun', 'idbasin','time'], as_index=False)['q'].sum()
    print (df)
        idrun  idbasin  time    q
    0 -192541        1     0  0.0
    1 -192541        1     1  1.5
    2 -192541        3     0  0.0
    3 -192541        3     1  1.0
    4 -192540        1     0  0.0
    5 -192540        1     1  1.5
    6 -192540        3     0  0.0
    7 -192540        3     1  1.0