Search code examples
pythonpandasiteratorambiguous

Understanding row iteration in pandas (python)


I have a dataframe with (in this example) 2 rows and the dataframe looks a bit like this:

PERON   START   END
AB     100      120
CC     110      115

(What I want , but which is not the question I have, is to make a new column with a flag for each row if "START"-'END' is equal to 20) (I also had similar problems in other use-cases)

(I tried

df['New']= df.apply(lambda x: 1 if abs(df.START-df.END)==20 else 0

I also tried:

 for i in df.iterrows():
    print i[0]
    if abs(df.START-df.END)==20:
        print 'Legit to make be a flag'

Problem: Each time I get the message "The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."

I understand that if you do not go through each row python doesn't know to which value to compare the "if-statement", therefore you use a.any() if you want to give a flag if at least 1 row does satify the condition of the if-statement. a.all() if all rows should match the row-condition...

Now, my question: Why is python not checking the if-statement on row-basis (certainly in the for-loop) and keeps it giving me the "The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."-Error

Other formulation of the question: Why is python in the "for i in df.iterrows():" still executing the if-statement on the whole dataframe-level instead on row-by-row basis.

Because in the code below it prints that the for loop is in row '0' so I would assume python would do the if statement on row '0'. But it does not check the if statement for this row because it gives me the "Ambiguous thruth value error"...

for i in df.iterrows():
    print i[0]

I've looked at the "using-if-truth-statements-with-pandas"-documentation and some other stackoverflow questions but I can't figure out where I go wrong (highlikely because I'm still in the python-noob-phase).


Solution

  • This is down to how pandas vectorizes operations where it can.

    abs(df.START-df.END) ==20
    

    itself returns a series. This series is the column you seek:

    >>> df = pd.DataFrame([[100,120],[110,115]],index=['AB','CC'],columns=['START','END'])
    ... df
    ... abs(df.START-df.END)==20
    13: AB     True
    CC    False
    dtype: bool
    

    Observe:

    >>> df['New'] = abs(df.START-df.END)==20
    >>> df
    15:     START  END    New
    AB    100  120   True
    CC    110  115  False
    

    pandas is designed for this kind of operation.

    You should not be disheartened, as both of your attempts were pretty close:

    df['New']= df.apply(lambda x: 1 if abs(df.START-df.END)==20 else 0)
    

    is broken, because abs(df.START-df.END)==20 returns a Series. For each row, that lambda is asking if a Series is True, hence the error.

    If instead, you had:

    df['New']= df.apply(lambda x: 1 if abs(x.START-x.END)==20 else 0, axis =1)
    

    this would produce the desired result, because abs(x.START-x.END)==20 returns a boolean, not a series.