Search code examples
pythonpandasreplacefillna

pandas replace only part of a column


Here is my input:

import pandas as pd
import numpy as np

list1 = [10,79,6,38,4,557,12,220,46,22,45,22]
list2 = [4,3,23,6,234,47,312,2,426,42,435,23]

df = pd.DataFrame({'A' : list1, 'B' : list2}, columns = ['A', 'B'])
df['C'] = np.where (df['A'] > df['B'].shift(-2), 1, np.nan)
print (df)

that produces this output:

      A    B    C
0    10    4  NaN
1    79    3  1.0
2     6   23  NaN
3    38    6  NaN
4     4  234  NaN
5   557   47  1.0
6    12  312  NaN
7   220    2  1.0
8    46  426  NaN
9    22   42  NaN
10   45  435  NaN
11   22   23  NaN

What I need to do is to change column 'C' to be a set of three 1's in a row, non-overlapping. The desired output is:

      A    B    C
0    10    4  NaN
1    79    3  1.0
2     6   23  1.0
3    38    6  1.0
4     4  234  NaN
5   557   47  1.0
6    12  312  1.0
7   220    2  1.0
8    46  426  NaN
9    22   42  NaN
10   45  435  NaN
11   22   23  NaN

So, rows 2, 3, and 6 change from NaN to 1.0. Row 7 already has a 1.0 and it is ignored. Rows 8 and 9 need to stay NaN because row 7 is the last entry of the previous set.

I don't know if there is a better way to build column 'C' that would do this at creation.

I have tried several versions of fillna and ffill, none of them worked for me.

It seems very convoluted but I tried to isolate the row id's for each 1.0 with this line:

print (df.loc[df['C'] == 1])

Which correctly outputs this:

     A   B    C
1   79   3  1.0
5  557  47  1.0
7  220   2  1.0

Even though I know that information, I don't know how to proceed from there.

Thank you so much for your help in advance, David


Solution

  • EDIT:

    Faster version (thanks to b2002):

    ii = df[pd.notnull(df.C)].index
    dd = np.diff(ii)
    jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
    jj = [ii[0]] + jj
    
    for ci in jj:
        df.C.values[ci:ci+3] = 1.0
    

    First get the indices of all your starting points, i.e. all your points that are 1.0 and have two NaN following, by looking at the differences between the points that are not null in the C column (first index is included by default), then iterate over those indices and use loc to change slices of your C column:

    ii = df[pd.notnull(df.C)].index
    dd = np.diff(ii)
    jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
    jj = [ii[0]] + jj
    
    for ci in jj:
        df.loc[ci:ci+2,'C'] = 1.0
    

    Result:

          A    B    C
    0    10    4  NaN
    1    79    3  1.0
    2     6   23  1.0
    3    38    6  1.0
    4     4  234  NaN
    5   557   47  1.0
    6    12  312  1.0
    7   220    2  1.0
    8    46  426  NaN
    9    22   42  NaN
    10   45  435  NaN
    11   22   23  NaN