Pandas newbie here.
I'm trying to create a new column in my data frame that will serve as a training label when I feed this into a classifier.
The value of the label column is 1.0 if a given Id has (Value1 > 0) or (Value2 > 0) for Apples or Pears, and 0.0 otherwise.
My dataframe is row indexed by Id and looks like this:
Out[30]:
Value1 Value2 \
ProductName 7Up Apple Cheetos Onion Pear PopTart 7Up
ProductType Drinks Groceries Snacks Groceries Groceries Snacks Drinks
Id
100 0.0 1.0 2.0 4.0 0.0 0.0 0.0
101 3.0 0.0 0.0 0.0 3.0 0.0 4.0
102 0.0 0.0 0.0 0.0 0.0 2.0 0.0
ProductName Apple Cheetos Onion Pear PopTart
ProductType Groceries Snacks Groceries Groceries Snacks
Id
100 1.0 3.0 3.0 0.0 0.0
101 0.0 0.0 0.0 2.0 0.0
102 0.0 0.0 0.0 0.0 1.0
If the pandas wizards could give me a hand with the syntax for this operation - my mind is struggling to put it all together.
Thanks!
The answer provided by @vlad.rad works, but it is not very efficient since pandas has to manually loop in Python over all rows, not being able to take advantage of numpy vectorized functions speedup. The following vectorized solution should be more efficient:
condition = (df['Value1'] > 0) | (df['Value2'] > 0)
df.loc[condition, 'label'] = 1.
df.loc[~condition, 'label'] = 0.