Search code examples
pythonpandaswhere-clauseapplyfill

Conditional fill of columns in a pandas df


This question is similar to a few questions regarding conditionally filling. I'm trying to conditionally fill the column based off the following statements.

If the value in Code starts with A, I want to keep the values as they are.

If the value Code starts with B, I want to keep the same initial value and return nan's to the following rows until the next value in Code.

If the value in Code starts with C, I want to keep the same first value until the next floats in ['Numx','Numy]

import pandas as pd
import numpy as np


d = ({                          
      'Code' :['A1','A1','','B1','B1','A2','A2','','B2','B2','','A3','A3','A3','','B1','','B4','B4','A2','A2','A1','A1','','B4','B4','C1','C1','','','D1','','B2'],
      'Numx' : [30.2,30.5,30.6,35.6,40.2,45.5,46.1,48.1,48.5,42.2,'',30.5,30.6,35.6,40.2,45.5,'',48.1,48.5,42.2, 40.1,48.5,42.2,'',48.5,42.2,43.1,44.1,'','','','',45.1],
      'Numy' : [1.9,2.3,2.5,2.2,2.5,3.1,3.4,3.6,3.7,5.4,'',2.3,2.5,2.2,2.5,3.1,'',3.6,3.7,5.4,6.5,8.5,2.2,'',8.5,2.2,2.3,2.5,'','','','',3.2]
      })

df = pd.DataFrame(data=d)

Output:

   Code  Numx Numy
0    A1  30.2  1.9
1    A1  30.5  2.3
2        30.6  2.5
3    B1  35.6  2.2
4    B1  40.2  2.5
5    A2  45.5  3.1
6    A2  46.1  3.4
7        48.1  3.6
8    B2  48.5  3.7
9    B2  42.2  5.4
10        nan  nan       
11   A3  30.5  2.3
12   A3  30.6  2.5
13   A3  35.6  2.2
14       40.2  2.5
15   B1  45.5  3.1
16        nan  nan     
17   B4  48.1  3.6
18   B4  48.5  3.7
19   A2  42.2  5.4
20   A2  40.1  6.5
21   A1  48.5  8.5
22   A1  42.2  2.2
23        nan  nan      
24   B4  48.5  8.5
25   B4  42.2  2.2
26   C1  43.1  2.3
27   C1  44.1  2.5
28        nan  nan      
29        nan  nan   
30   D1   nan  nan      
31        nan  nan        
32   B2  45.1  3.2

I have used code posted from another question but I return too many Nan's

df['Code_new'] = df['Code'].where(df['Code'].isin(['A1','A2','A3','A4','B1','B2','B4','C1'])).ffill()

df[['Numx','Numy']] = df[['Numx','Numy']].mask(df['Code_new'].duplicated())
mask = df['Code_new'] == 'A1'
df.loc[mask, ['Numx','Numy']] = df.loc[mask, ['Numx','Numy']].ffill()

This produces this output:

   Code  Numx Numy Code_new
0    A1  30.2  1.9       A1
1    A1  30.2  1.9       A1
2        30.2  1.9       A1
3    B1  35.6  2.2       B1
4    B1   NaN  NaN       B1
5    A2  45.5  3.1       A2
6    A2   NaN  NaN       A2
7         NaN  NaN       A2
8    B2  48.5  3.7       B2
9    B2   NaN  NaN       B2
10        NaN  NaN       B2
11   A3  30.5  2.3       A3
12   A3   NaN  NaN       A3
13   A3   NaN  NaN       A3
14        NaN  NaN       A3
15   B1   NaN  NaN       B1
16        NaN  NaN       B1
17   B4  48.1  3.6       B4
18   B4   NaN  NaN       B4
19   A2   NaN  NaN       A2
20   A2   NaN  NaN       A2
21   A1  30.2  1.9       A1
22   A1  30.2  1.9       A1
23       30.2  1.9       A1
24   B4   NaN  NaN       B4
25   B4   NaN  NaN       B4
26   C1  43.1  2.3       C1
27   C1   NaN  NaN       C1
28        NaN  NaN       C1
29        NaN  NaN       C1
30   D1   NaN  NaN       C1
31        NaN  NaN       C1
32   B2   NaN  NaN       B2

My desired output would be:

   Code  Numx Numy
0    A1  30.2  1.9
1    A1  30.5  2.3
2        30.6  2.5
3    B1  35.6  2.2
4    B1   nan  nan
5    A2  45.5  3.1
6    A2  46.1  3.4
7        48.1  3.6
8    B2  48.5  3.7
9    B2   nan  nan
10        nan  nan        
11   A3  30.5  2.3
12   A3  30.6  2.5
13   A3  35.6  2.2
14       40.2  2.5
15   B1  45.5  3.1
16        nan  nan         
17   B4  48.1  3.6
18   B4   nan  nan
19   A2  42.2  5.4
20   A2  40.1  6.5
21   A1  48.5  8.5
22   A1  42.2  2.2
23        nan  nan      
24   B4  48.5  8.5
25   B4   nan  nan
26   C1  43.1  2.3
27   C1  43.1  2.3
28       43.1  2.3   
29       43.1  2.3   
30   D1  43.1  2.3   
31       43.1  2.3         
32   B2  45.1  3.2

I think this this line mask = df['Code_new'] == 'A1' I need to change. The code works but I'm only applying to to the values in code that are 'A1'. Is as easy as adding all the other values in here. So A1-A4,B1-B4,C1?


Solution

  • I believe need

    m2 = df['Code'].isin(['A1','A2','A3','A4','B1','B2','B4','C1'])
    
    #create helper column for unique categories
    df['Code_new'] = df['Code'].where(m2).ffill()
    df['Code_new'] = (df['Code_new'] + '_' + 
                      df['Code_new'].ne(df['Code_new'].shift()).cumsum().astype(str))
    
    #check by start values and filter all columns without A
    m1 = df['Code_new'].str.startswith(tuple(['A1','A2','A3','A4'])).fillna(False)
    df[['Numx','Numy']] = df[['Numx','Numy']].mask(df['Code_new'].duplicated() & ~m1)
    
    #replace by forward filling only starting with C
    mask = df['Code_new'].str.startswith('C').fillna(False)
    df.loc[mask, ['Numx','Numy']] = df.loc[mask, ['Numx','Numy']].ffill()
    
    print (df)
    

       Code  Numx Numy Code_new
    0    A1  30.2  1.9     A1_1
    1    A1  30.5  2.3     A1_1
    2        30.6  2.5     A1_1
    3    B1  35.6  2.2     B1_2
    4    B1   NaN  NaN     B1_2
    5    A2  45.5  3.1     A2_3
    6    A2  46.1  3.4     A2_3
    7        48.1  3.6     A2_3
    8    B2  48.5  3.7     B2_4
    9    B2   NaN  NaN     B2_4
    10        NaN  NaN     B2_4
    11   A3  30.5  2.3     A3_5
    12   A3  30.6  2.5     A3_5
    13   A3  35.6  2.2     A3_5
    14       40.2  2.5     A3_5
    15   B1  45.5  3.1     B1_6
    16        NaN  NaN     B1_6
    17   B4  48.1  3.6     B4_7
    18   B4   NaN  NaN     B4_7
    19   A2  42.2  5.4     A2_8
    20   A2  40.1  6.5     A2_8
    21   A1  48.5  8.5     A1_9
    22   A1  42.2  2.2     A1_9
    23                     A1_9
    24   B4  48.5  8.5    B4_10
    25   B4   NaN  NaN    B4_10
    26   C1  43.1  2.3    C1_11
    27   C1  43.1  2.3    C1_11
    28       43.1  2.3    C1_11
    29       43.1  2.3    C1_11
    30   D1  43.1  2.3    C1_11
    31       43.1  2.3    C1_11
    32   B2  45.1  3.2    B2_12