Search code examples
pythonpandasif-statementmax

pandas how take max,min,mean from rows by condition


I am facing the following problem: I can do conditional column calculations, but, i can't use the functions min, max, mean for every row by condition

import pandas as pd
data = {'Name': ['Tom', 'nick', 'krish', 'jack','bob'],
        'level_1': [2, -3, 7, -4,-2],
        'level_2': [6, 5, -3, -9,2],
        'level_3': [-2, -1, 4, 6,-4],
        }

df = pd.DataFrame(data)

print(df)


def f(row):
    if (row['level_1'] < 0) & (row['level_2'] < 0):
        
        val = row['level_1'] + row['level_2']
    elif (row['level_1'] > 0) & (row['level_2'] > 0):

        val = row['level_1'] - row['level_2']

    else:
        val = row[['level_1','level_2','level_3']].max#(axis=1)

    return val

df['level_final'] = df.apply(f, axis=1)

print(40*'#')

print(df)

can any one help me please?


Solution

  • # Cleaner & Faster solution

    Here is a cleaner way to do this (and faster) using numpy, than using lambda with an apply method (which is not vectorized over rows)-

    import numpy as np
    
    df = df.set_index('Name')       #to work with numeric features
    
    #Your 2 conditions
    c1 = (df.level_1 < 0) & (df.level_2 < 0)
    c2 = (df.level_1 > 0) & (df.level_2 > 0)
    
    #Your 3 values
    v1 = df.level_1 + df.level_2    #for c1
    v2 = df.level_1 - df.level_2    #for c2
    v3 = df.max(1)                  #if c1 and c2 fail
    
    #apply everything, everywhere all at once
    df['val'] = np.select([c1, c2], [v1, v2], default=v3)
    df = df.reset_index()
    
    print(df)
    
        Name  level_1  level_2  level_3  val
    0    Tom        2        6       -2   -4
    1   nick       -3        5       -1    5
    2  krish        7       -3        4    7
    3   jack       -4       -9        6  -13
    4    bob       -2        2       -4    2
    

    Read more about numpy.select here.


    # Notes on your current code

    "i need max from every row (axis=1) by condition" (from your comment)

    The reason your code fails is that you are misunderstanding how the apply method works. When you are using an apply function over rows, you are pulling each row as a pd.Series object and then applying your function f function on that, one row at a time.

    When you use the pd.Series.max(axis=1) on any row series object, it throws the following error as expected -

    ValueError: No axis named 1 for object type Series
    

    For a series object, you can't use max.axis(1) since it has no axis. So instead of using that or simply max which is the incorrect syntax for a method, just use pd.Series.max(). Read documentation here

    #Your code modified
    def f(row):
        if (row['level_1'] < 0) & (row['level_2'] < 0):
            val = row['level_1'] + row['level_2']
        elif (row['level_1'] > 0) & (row['level_2'] > 0):
            val = row['level_1'] - row['level_2']
        else:
            val = row[['level_1','level_2','level_3']].max() #<---
        return val
    
    df['level_final'] = df.apply(f, axis=1)
    print(df)
    
        Name  level_1  level_2  level_3  level_final
    0    Tom        2        6       -2           -4
    1   nick       -3        5       -1            5
    2  krish        7       -3        4            7
    3   jack       -4       -9        6          -13
    4    bob       -2        2       -4            2