Search code examples
python-3.xpandasnumpymathelementwise-operations

Custom algorithm to deal with negative values within a DataFrame


To start, I have a DataFrame that looks like the following:

df = pd.DataFrame({'a': [25, 22, -2, 16, 10], 'b': [-5, 18, -2, 25, 48], 'c': [34, -12, 7, 8, 22],
'd': [10, 8, -2, -4, 12]})

Goal: Eliminate all zeroes using a specific script or function that preserves the effect of the negative value within each column.

I am trying to develop a method to look through a data frame, find the negative values and take the absolute value of the negative value and add one. In essence, this will replace each negative value in the DataFrame with positive value of one.

Next, I want to subtract the value I calculated after taking the absolute value of the negative number (plus one) and subtract it from the next row value (within that same column).

In addition: In cases where the value following a negative value is also negative, I want to do the same operation for both negative values, but I want to subtract the sum of the absolute value plus one, for each negative number, and subtract it from the next positive row. If the row value following the corrected negative becomes less than 1 after I want to subtract off from the row after that, until the negative value is gone and no rows following them are less than 1.

The expected output will hopefully help grasp what I intend to do:

expected_output = pd.DataFrame({'a': [25, 22, 1, 13, 10], 'b': [1, 12, 1, 22, 48],
'c': [34, 1, 1, 1, 22],'d': [10, 8, 1, 1, 4]})

I can replace the negative value with the absolute value of the negative value, plus one, using:

df[df < 0] = abs(df[df < 0] + 1)

I also know I can find the location of the negative value using:

neg_loc = df.loc[df['a'] < 0].index

Now I find the value after the negative value using:

row_after_neg = df['a'].iloc[neg_loc + 1]

Lastly, I can add the absolute value of the negative value plus one to the row after the negative value with:

total = row_after_neg.add(abs(neg_loc  + 1))

So, my question is how do I stitch this together so that it goes through the entire DataFrame and does what I specified.

Thank you in advance for the advice/help!


Solution

  • Your ask was a bit confusing, so I hope I address all of the requirements, if not let me know in the comments. I chose to use a for-loop since you want to do a row by row comparison. If speed is an issue I would avoid the for-loop and see if you can stay within the panda's architecture.

    Setup:

    import pandas as pd
    df = pd.DataFrame({'a': [25, 22, -2, 16, 10], 'b': [-5, 18, -2, 25, 48], 'c': [34, -12, 7, 8, 22],
    'd': [10, 8, -2, -4, 12]})
    

    Creating a version of the dataframe which has the abs(negative values) + 1 and 0 to replace the nans for the positive values:

    pos_df = (abs(df[df < 0]) + 1).fillna(0)
    

    For loop to iterate through the dataframe starting from the second row:

    for index, row in df.iloc[1:,:].iterrows():
    

    Then you subtract the row of the dataframe by the previous row of the positive dataframe

    df.loc[index] = row - pos_df.loc[index - 1]
    

    Then you recalculate the row of the pos_df since you would want to check if any numbers have become negative. A caveat to notice is that I am switching everything that is negative in the df dataframe not the pos_df.

    pos_df.loc[index][df.loc[index] < 0] = (abs(df.loc[index][df.loc[index] < 0]) + 1).fillna(0)
    

    Lastly change all negative values to 1:

    df[df < 0] = 1
    

    Here's the entire code:

    import pandas as pd
    df = pd.DataFrame({'a': [25, 22, -2, 16, 10], 'b': [-5, 18, -2, 25, 48], 'c': [34, -12, 7, 8, 22],
    'd': [10, 8, -2, -4, 12]})
    pos_df = (abs(df[df < 0]) + 1).fillna(0)
    
    for index, row in df.iloc[1:,:].iterrows():
        df.loc[index] = row - pos_df.loc[index - 1]
        pos_df.loc[index][df.loc[index] < 0] = (abs(df.loc[index][df.loc[index] < 0]) + 1).fillna(0)
    
    df[df < 0] = 1
    

    With the final output being:

        a       b       c       d
    0   25.0    1.0     34.0    10.0
    1   22.0    12.0    1.0     8.0
    2   1.0     1.0     1.0     1.0
    3   13.0    22.0    1.0     1.0
    4   10.0    48.0    22.0    4.0
    

    Hope that helps!

    Edit:

    So the code:

    pos_df.loc[index][df.loc[index] < 0] = (abs(df.loc[index][df.loc[index] < 0]) + 1).fillna(0)
    

    is a relatively complex pandas expression. A bit of context, In Pandas there are series and dataframes, you can think of series as a column or row of a dataframe. When you perform a conditional selection in dataframes, the dataframe retains it's shape, with the values that didn't meet the criteria appearing as Nan. With a Series, you get just the values which met the condition.

    Here's an example:

    df[df == 1]
    series[series == 1]
    
    
       a       b       c       d
    0   Nan    Nan    Nan    Nan
    1   Nan    Nan    1.0    Nan
    2   Nan    Nan    1.0    1.0
    3   Nan    Nan    1.0    1.0
    4   Nan    Nan    Nan    Nan
    
    c
    1.0
    1.0
    1.0
    

    So as you can see above the series changes shape from [5,1] to [3,1]. Now back to the code. pos_df.loc[index] selects the row we're interested in from the dataframe with the transformed negative numbers. It might be easier to conceptualize if you view it as:

    pos_s = pos_df.loc[index]
    s = df.loc[index]
    
    pos_s[s < 0] = (abs(s[s < 0]) + 1).fillna(0)
    

    So as you can see above this is just the same expression you used in your question, but performed on the rows that the for-loop is iterating through at the moment rather than the entire dataframe. In the code:

    (abs(s[s < 0]) + 1).fillna(0)
    

    I'm finding all of the values in the dataframe that were or have become negative and retransforming them. I then get a series like:

    c
    13
    12
    4
    

    So as this is a series, you'll notice that the shape is [3,1] and not [5,1] as expected. So to avoid messing up the dataframe, I must only replace the negative values. So I use the code:

    pos_s[s < 0]
    

    I'm selecting everything in df's row which is negative, so the output is

    [False, True, True, False, True] 
    

    and then I apply this condition to the pos_df's row to select the second, third and fifth values and update them incase any of the originally positive values have become negative once subtracted.