Search code examples
pythonpandasnumpyvectorization

numpy vectorization(accumulate variable)


I have a dataframe. Witch contain a few columns. It's look like this:

A B C D
1 10 a Nan
2 11 b Nan
3 12 c Nan

So if I have 'b' in column C, I should do A+B. In other cases A*B. But with that I have variable that accumulate value(You will see code and it will be cleare). So I write this code

z = 0
for i, row in df.iterrows():
    a = df['A']
    b = df['B']
    c = df['C']
    if c == 'b':
        d = a + b + z
        z = z + 2
    else:
        d = a*b
    df.at[i, 'D'] = d

But df.iterrows() is antipattern and I should avoid this string in my code. Because if my data set increase it will be a problem I have tried to use vectorization but I can't figure out how to accumulate. Code look like this:

z = 0
con = (df['C'] == 'b',
      df['C'] != 'b')
choise = (
    (df['A'] + dfs['B'], z + 2),
    (df['A'] * dfs['B'], )
)

dfs['D'], z = np.select(con, choise)

Can someone help me with that? How to accumulate variable z?


Solution

  • I'm puzzled as to what that first code block is supposed to be doing:

    z = 0
    for i, row in df.iterrows():
        a = df['A']
        b = df['B']
        c = df['C']
        if c == 'b':
            d = a + b + z
            z = z + 2
        else:
            d = a*b
        df.at[i, 'D'] = d
    

    i and rows are the iteration variables, but you don't use rows, and only use i at the end to set something in the original df.

    Do you understand what iterrows does (other than all it an "antipattern"):

    Look at a small df:

    In [168]: df = pd.DataFrame(np.arange(6).reshape(2,3), columns=['A','B','C'])
    In [169]: df
    Out[169]: 
       A  B  C
    0  0  1  2
    1  3  4  5
    

    and do iterrows with a lots of prints:

    In [170]: for i, row in df.iterrows():
         ...:     print('==========')
         ...:     print(i, type(row));print(row)
         ...:     a = df['A']
         ...:     print('a', type(a));print(a)
         ...: 
    ==========
    0 <class 'pandas.core.series.Series'>
    A    0
    B    1
    C    2
    Name: 0, dtype: int64
    a <class 'pandas.core.series.Series'>
    0    0
    1    3
    Name: A, dtype: int64
    ==========
    1 <class 'pandas.core.series.Series'>
    A    3
    B    4
    C    5
    Name: 1, dtype: int64
    a <class 'pandas.core.series.Series'>
    0    0
    1    3
    Name: A, dtype: int64
    

    rows is a pandas Series (e.g. one column of a dataframe), with data from one row. It's like it turn the row into a column. df['A'] is also a Series, but one of the df columns.

    That whole:

    a = df['A']
    b = df['B']
    c = df['C']
    if c == 'b':
        d = a + b + z
        z = z + 2
    else:
        d = a*b
    

    block of code is working with the columns of the frame - whole columns, not values from one row. There's no point in repeating those calculations again and again in the loop.

    c is a Series, so if c=='b with raise an error. Using the a from my example:

    The '==' test produces a Series In [172]: a==3 Out[172]: 0 False 1 True Name: A, dtype: bool

    Using that Series in an if raises an ambiguity error.

    In [173]: if a==3: print('yes')
    Traceback (most recent call last):
      File "<ipython-input-173-1ccc6f02d1f6>", line 1, in <module>
        if a==3: print('yes')
      File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 1537, in __nonzero__
        raise ValueError(
    ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
    

    So your use of iterrows is more than a "anti-pattern". The code that uses is just plain wrong. I gone into a lot of detail because I think you need more than "quick" answer. You need to understand what is happening in your code.