I have a dataframe that looks like this:
print(df)
Out[1]:
Numbers
0 0
1 1
2 1
3 1
4 1
5 0
6 0
7 1
8 0
9 1
10 1
I want to transform it to this:
print(dfo)
Out[2]:
Numbers
0 0
1 1
2 2
3 3
4 4
5 0
6 0
7 1
8 0
9 1
10 2
The solution to this,I thought, it could be an iloc with 2 ifs: Check if the digit in df is 1, if1 true then check if2 the i-1 is 1, if true then in dfo see the value of i-1 and add 1,elifs just put the value of 0 in dfo.
I've tryed this:
# Import pandas library
import pandas as pd
# initialize list elements
list = [0,1,1,1,1,0,0,1,0,1,1]
# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(list, columns=['Numbers'])
# print dataframe.
df
data1c = df.copy()
for j in df:
for i in range(len(df)):
if df.loc[i, j] == 1:
if df.loc[i-1, j] == 1:
data1c.loc[i, j] = data1c.loc[i-1, j]+1
elif df.loc[i-1, j] == 0:
data1c.loc[i, j] = 1
elif df.loc[i, j] == 0:
data1c.loc[i, j] = 0
print(data1c)
Numbers
0 0
1 1
2 2
3 3
4 4
5 0
6 0
7 1
8 0
9 1
10 2
and for a dataframe of 1 column it works, but when I've tryed with a dataframe with 2 columns :
input = {'A': [0,1,1,1,1,0,0,1,0,1,1,0,1,1],
'B': [1,1,0,0,1,1,1,0,0,1,0,1,1,0]}
df = pd.DataFrame(input)
# Print the output.
df
data2c = df.copy()
for j in dfo:
for i in range(len(dfo)):
if dfo.loc[i, j] == 1:
if dfo.loc[i-1, j] == 1:
data2c.loc[i, j] = data2c.loc[i-1, j]+1
elif dfo.loc[i-1, j] == 0:
data2c.loc[i, j] = 1
elif dfo.loc[i, j] == 0:
data2c.loc[i, j] = 0
I get :
File "C:\Users\talls\.conda\envs\Spyder\lib\site-packages\pandas\core\indexes\range.py", line 393, in get_loc
raise KeyError(key) from err
KeyError: -1
Why do I get this error and how do I fix it? or Is there another way to get my desired out put?
I know this is not the answer to the question "How to use iloc to get reference from the rows above?", but it is the answer to your proposed question.
df = pd.DataFrame([0,1,1,1,1,0,0,1,0,1,1], columns=['Numbers'])
df['tmp'] = (~(df['Numbers'] == df['Numbers'].shift(1))).cumsum()
df['new'] = df.groupby('tmp')['Numbers'].cumsum()
print(df['new'])
Numbers tmp new
0 0 1 0
1 1 2 1
2 1 2 2
3 1 2 3
4 1 2 4
5 0 3 0
6 0 3 0
7 1 4 1
8 0 5 0
9 1 6 1
10 1 6 2
How does this work? The inner part ~(df['Numbers'] == df['Numbers'].shift(1))
checks whether the previous line is the same as the current line. For the first line, this works perfectly as well, because a number and a NaN
always compare to False
. Then I negate it mark the start of each new sequence with a True
. When I then do a cumulative sum, I "group" all values with the created tmp
column and do a cumulative sum over it to get the required answer in the new
column.
For the two columned version, you'd do exactly the same... for both columns A and B:
df = pd.DataFrame(input)
df['tmp'] = (~(df['A'] == df['A'].shift(1))).cumsum()
df['newA'] = df.groupby('tmp')['A'].cumsum()
#Just reusing column tmp
df['tmp'] = (~(df['B'] == df['B'].shift(1))).cumsum()
df['newB'] = df.groupby('tmp')['B'].cumsum()
print(df)
A B tmp newA newB
0 0 1 1 0 1
1 1 1 1 1 2
2 1 0 2 2 0
3 1 0 2 3 0
4 1 1 3 4 1
5 0 1 3 0 2
6 0 1 3 0 3
7 1 0 4 1 0
8 0 0 4 0 0
9 1 1 5 1 1
10 1 0 6 2 0
11 0 1 7 0 1
12 1 1 7 1 2
13 1 0 8 2 0