To start, I have a DataFrame that looks like the following:
df = pd.DataFrame({'a': [25, 22, -2, 16, 10], 'b': [-5, 18, -2, 25, 48], 'c': [34, -12, 7, 8, 22],
'd': [10, 8, -2, -4, 12]})
I am trying to develop a method to look through a data frame, find the negative values and take the absolute value of the negative value and add one. In essence, this will replace each negative value in the DataFrame with positive value of one.
Next, I want to subtract the value I calculated after taking the absolute value of the negative number (plus one) and subtract it from the next row value (within that same column).
In addition: In cases where the value following a negative value is also negative, I want to do the same operation for both negative values, but I want to subtract the sum of the absolute value plus one, for each negative number, and subtract it from the next positive row. If the row value following the corrected negative becomes less than 1 after I want to subtract off from the row after that, until the negative value is gone and no rows following them are less than 1.
The expected output will hopefully help grasp what I intend to do:
expected_output = pd.DataFrame({'a': [25, 22, 1, 13, 10], 'b': [1, 12, 1, 22, 48],
'c': [34, 1, 1, 1, 22],'d': [10, 8, 1, 1, 4]})
I can replace the negative value with the absolute value of the negative value, plus one, using:
df[df < 0] = abs(df[df < 0] + 1)
I also know I can find the location of the negative value using:
neg_loc = df.loc[df['a'] < 0].index
Now I find the value after the negative value using:
row_after_neg = df['a'].iloc[neg_loc + 1]
Lastly, I can add the absolute value of the negative value plus one to the row after the negative value with:
total = row_after_neg.add(abs(neg_loc + 1))
So, my question is how do I stitch this together so that it goes through the entire DataFrame and does what I specified.
Thank you in advance for the advice/help!
Your ask was a bit confusing, so I hope I address all of the requirements, if not let me know in the comments. I chose to use a for-loop since you want to do a row by row comparison. If speed is an issue I would avoid the for-loop and see if you can stay within the panda's architecture.
Setup:
import pandas as pd
df = pd.DataFrame({'a': [25, 22, -2, 16, 10], 'b': [-5, 18, -2, 25, 48], 'c': [34, -12, 7, 8, 22],
'd': [10, 8, -2, -4, 12]})
Creating a version of the dataframe which has the abs(negative values) + 1 and 0 to replace the nans
for the positive values:
pos_df = (abs(df[df < 0]) + 1).fillna(0)
For loop to iterate through the dataframe starting from the second row:
for index, row in df.iloc[1:,:].iterrows():
Then you subtract the row of the dataframe by the previous row of the positive dataframe
df.loc[index] = row - pos_df.loc[index - 1]
Then you recalculate the row of the pos_df
since you would want to check if any numbers have become negative. A caveat to notice is that I am switching everything that is negative in the df
dataframe not the pos_df
.
pos_df.loc[index][df.loc[index] < 0] = (abs(df.loc[index][df.loc[index] < 0]) + 1).fillna(0)
Lastly change all negative values to 1:
df[df < 0] = 1
Here's the entire code:
import pandas as pd
df = pd.DataFrame({'a': [25, 22, -2, 16, 10], 'b': [-5, 18, -2, 25, 48], 'c': [34, -12, 7, 8, 22],
'd': [10, 8, -2, -4, 12]})
pos_df = (abs(df[df < 0]) + 1).fillna(0)
for index, row in df.iloc[1:,:].iterrows():
df.loc[index] = row - pos_df.loc[index - 1]
pos_df.loc[index][df.loc[index] < 0] = (abs(df.loc[index][df.loc[index] < 0]) + 1).fillna(0)
df[df < 0] = 1
With the final output being:
a b c d
0 25.0 1.0 34.0 10.0
1 22.0 12.0 1.0 8.0
2 1.0 1.0 1.0 1.0
3 13.0 22.0 1.0 1.0
4 10.0 48.0 22.0 4.0
Hope that helps!
Edit:
So the code:
pos_df.loc[index][df.loc[index] < 0] = (abs(df.loc[index][df.loc[index] < 0]) + 1).fillna(0)
is a relatively complex pandas expression. A bit of context, In Pandas there are series
and dataframes
, you can think of series
as a column or row of a dataframe. When you perform a conditional selection in dataframes, the dataframe retains it's shape, with the values that didn't meet the criteria appearing as Nan
. With a Series, you get just the values which met the condition.
Here's an example:
df[df == 1]
series[series == 1]
a b c d
0 Nan Nan Nan Nan
1 Nan Nan 1.0 Nan
2 Nan Nan 1.0 1.0
3 Nan Nan 1.0 1.0
4 Nan Nan Nan Nan
c
1.0
1.0
1.0
So as you can see above the series changes shape from [5,1] to [3,1]. Now back to the code. pos_df.loc[index]
selects the row we're interested in from the dataframe with the transformed negative numbers. It might be easier to conceptualize if you view it as:
pos_s = pos_df.loc[index]
s = df.loc[index]
pos_s[s < 0] = (abs(s[s < 0]) + 1).fillna(0)
So as you can see above this is just the same expression you used in your question, but performed on the rows that the for-loop is iterating through at the moment rather than the entire dataframe. In the code:
(abs(s[s < 0]) + 1).fillna(0)
I'm finding all of the values in the dataframe that were or have become negative and retransforming them. I then get a series like:
c
13
12
4
So as this is a series, you'll notice that the shape is [3,1] and not [5,1] as expected. So to avoid messing up the dataframe, I must only replace the negative values. So I use the code:
pos_s[s < 0]
I'm selecting everything in df
's row which is negative, so the output is
[False, True, True, False, True]
and then I apply this condition to the pos_df
's row to select the second, third and fifth values and update them incase any of the originally positive values have become negative once subtracted.