Search code examples
pythonpandasconditional-statementscriteria

filling a new column based on multiple criteria


Suppose that I have a dataset with three categorical columns: df.type1 df.type2 df.type3 and I want to create a new column [df.new] that it takes:

df.new = df.type1 if df.type1 is true and the remaining are false
df.new = df.type2 if df.type2 is true and the remaining are false
df.new = df.type3 if df.type3 is true and the remaining are false

What is the best approach? I am quite confused by np.where() - too long and too script intense

Example:

City    dt.t1   dt.t2   dt.t3
NY       US Non    EU   Non Asia
Rome     Non US    EU   Non Asia
SF       US Non    EU   Non Asia
HK       Non US    Non EU   Asia

my final result would be:

City  dt.new
NY    US
Rome  EU
SF    US
HK    Asia

Solution

  • Use:

    df = df.set_index('City')
    df['dt.new'] = df.mask(df.apply(lambda x: x.str.contains('Non\s+'))).ffill(axis=1).iloc[:, -1]
    

    Alternative solution with select columns for check values:

    cols = df.filter(regex='^dt\.').columns
    #or use list of columns names
    #cols = ['dt.t1','dt.t2','dt.t3']
    df['dt.new'] = df[cols].mask(df[cols].apply(lambda x: x.str.contains('Non\s+'))).ffill(axis=1).iloc[:, -1]
    

    print (df)
           dt.t1   dt.t2     dt.t3 dt.new
    City                                 
    NY        US  Non EU  Non Asia     US
    Rome  Non US      EU  Non Asia     EU
    SF        US  Non EU  Non Asia     US
    HK    Non US  Non EU      Asia   Asia
    

    Detail:

    First set_index by City column and then check for Non string with one or more whitespaces:

    df = df.set_index('City')
    
    print (df.apply(lambda x: x.str.contains('Non\s+')))
          dt.t1  dt.t2  dt.t3
    City                     
    NY    False   True   True
    Rome   True  False   True
    SF    False   True   True
    HK     True   True  False
    

    Then replace matching values to NaNs by mask:

    print (df.mask(df.apply(lambda x: x.str.contains('Non\s+'))))
         dt.t1 dt.t2 dt.t3
    City                  
    NY      US   NaN   NaN
    Rome   NaN    EU   NaN
    SF      US   NaN   NaN
    HK     NaN   NaN  Asia
    

    Forward fill non missing values per rows:

    print (df.mask(df.apply(lambda x: x.str.contains('Non\s+'))).ffill(axis=1))
         dt.t1 dt.t2 dt.t3
    City                  
    NY      US    US    US
    Rome   NaN    EU    EU
    SF      US    US    US
    HK     NaN   NaN  Asia
    

    And last select last column:

    print (df.mask(df.apply(lambda x: x.str.contains('Non\s+'))).ffill(axis=1).iloc[:, -1])
    City
    NY        US
    Rome      EU
    SF        US
    HK      Asia
    Name: dt.t3, dtype: object
    

    EDIT:

    m1 = df['dt.t1'] == 'US'
    m2 = df['dt.t2'] == 'EU'
    m3 = df['dt.t3'] == 'Asia'
    
    df['dt.new'] = np.select([m1, m2, m3], ['US','EU','Asia'], default=None)
    

    Or:

    df['dt.new'] = np.where(m1, 'US',
                   np.where(m2, 'EU',
                   np.where(m3, 'Asia', None)))
    

    print (df)
       City   dt.t1   dt.t2     dt.t3 dt.new
    0    NY      US  Non EU  Non Asia     US
    1  Rome  Non US      EU  Non Asia     EU
    2    SF      US  Non EU  Non Asia     US
    3    HK  Non US  Non EU      Asia   Asia