Search code examples
pythonpandasnumpysklearn-pandas

Append a new column based on existing columns


Pandas newbie here.

I'm trying to create a new column in my data frame that will serve as a training label when I feed this into a classifier.

The value of the label column is 1.0 if a given Id has (Value1 > 0) or (Value2 > 0) for Apples or Pears, and 0.0 otherwise.

My dataframe is row indexed by Id and looks like this:

Out[30]: 
                Value1                                               Value2  \
    ProductName    7Up     Apple Cheetos     Onion      Pear PopTart    7Up   
    ProductType Drinks Groceries  Snacks Groceries Groceries  Snacks Drinks   
Id                                                                        
100                0.0       1.0     2.0       4.0       0.0     0.0    0.0   
101                3.0       0.0     0.0       0.0       3.0     0.0    4.0   
102                0.0       0.0     0.0       0.0       0.0     2.0    0.0   


    ProductName     Apple Cheetos     Onion      Pear PopTart  
    ProductType Groceries  Snacks Groceries Groceries  Snacks  
Id                                                         
100                   1.0     3.0       3.0       0.0     0.0  
101                   0.0     0.0       0.0       2.0     0.0  
102                   0.0     0.0       0.0       0.0     1.0  

If the pandas wizards could give me a hand with the syntax for this operation - my mind is struggling to put it all together.

Thanks!


Solution

  • The answer provided by @vlad.rad works, but it is not very efficient since pandas has to manually loop in Python over all rows, not being able to take advantage of numpy vectorized functions speedup. The following vectorized solution should be more efficient:

    condition = (df['Value1'] > 0) | (df['Value2'] > 0)
    df.loc[condition, 'label'] = 1.
    df.loc[~condition, 'label'] = 0.