Search code examples
pythonpandasdataframegroup-bypipe

How to use numpy.where in a pipe function for pandas dataframe groupby?


Here is a script to simulate the issue I am facing:

import pandas as pd
import numpy as np

data = {
        'a':[1,2,1,1,2,1,1],
        'b':[10,40,20,10,40,10,20],
        'c':[0.3, 0.2, 0.6, 0.4, 0.5, 0.2, 0.8],
        'd':[3, 1, 5, 1, 7, 2., 2.],
        }

df = pd.DataFrame.from_dict(data)

# I apply some custom function to populate column 'e'.
# For demonstration, I am using a very simple function here.
df['e']=df.apply(lambda x: x['c']<=0.3, axis=1)

# This is the column I need to obtain using groupby and pipe/transform
df['f']=[2., 1., 0., 2., 1., 2., 0.]

print(df)

Output:

   a   b    c    d      e    f
0  1  10  0.3  3.0   True  2.0
1  2  40  0.2  1.0   True  1.0
2  1  20  0.6  5.0  False  0.0
3  1  10  0.4  1.0  False  2.0
4  2  40  0.5  7.0  False  1.0
5  1  10  0.2  2.0   True  2.0
6  1  20  0.8  2.0  False  0.0

The logic to be used to find column f is as follows:

For each group of df.groupby(['a', 'b']):

  • select entries with True value for e.
  • if there are any item in the selected array:
    • find entry with minimum d and return d (in real application, d needs to be manipulated in conjunction with other columns, and the result would be returned)
  • else:
    • return 0

What I have tried:

def func(x):
    print(type(x))
    print(x)
    print('-'*50)

    ind=np.where(x['e']) #<--- How can I implement this?

    if len(ind)>0:
        ind_min=np.argmin(x.iloc[ind]['d'])
        return x.iloc[ind[ind_min]]['d']
    else:
        return 0
    
df['g']=df.groupby(['a', 'b']).pipe(func)

Output:

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001B348735550>
--------------------------------------------------
...
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (3, 2) + inhomogeneous part.

I get the mentioned error on line: ind=np.where(x['e']) #<--- How can I implement this?

So, how can apply np.where on a pandas.core.groupby.generic.DataFrameGroupBy object?


Solution

  • You're presenting the XY problem. Here's one approach:

    cond = df['c'] <= 0.3
    
    df['f'] = (
        df.assign(filtered_d=df['d'].where(cond))
        .groupby(['a', 'b'])['filtered_d']
        .transform('min')
        .fillna(0)
        )
    

    Output:

       a   b    c    d      e    f
    0  1  10  0.3  3.0   True  2.0
    1  2  40  0.2  1.0   True  1.0
    2  1  20  0.6  5.0  False  0.0
    3  1  10  0.4  1.0  False  2.0
    4  2  40  0.5  7.0  False  1.0
    5  1  10  0.2  2.0   True  2.0
    6  1  20  0.8  2.0  False  0.0
    

    Explanation / Intermediate

    • First, apply Series.where to column 'd' based on the boolean series to keep only the values that we want to consider for min:
    # df['d'].where(cond)
    
    0    3.0
    1    1.0
    2    NaN
    3    NaN
    4    NaN
    5    2.0
    6    NaN
    Name: d, dtype: float64
    

    An alternative way to do this could be:

    cond = df['c'] <= 0.3
    
    df['f'] = (
        df.merge(
            df[cond]
            .groupby(['a', 'b'], as_index=False)
            .agg(f=('d', 'min')),
            on=['a', 'b'],
            how='left'
        ).assign(f=lambda x: x['f'].fillna(0))
    )
    

    Explanation