Search code examples
pythonpandasdataframedata-manipulation

How to count the sum of NaN's before the first occurrence of a string in a given column?


Suppose I have a df like,

column1 | column2    | column3 |
   1    | 2023-02-21 |   NaN   |
   1    | 2023-02-22 |   NaN   |
   1    | 2023-02-23 |    8    |
   1    | 2023-02-24 |   NaN   |
   1    | 2023-02-24 |   NaN   |
   1    | 2023-02-24 |   NaN   |
   1    | 2023-02-24 |   NaN   |
   1    | 2023-02-24 |   10    |
   2    | 2023-02-25 |   NaN   |
   2    | 2023-02-26 |    9    |

Is there a way to achieve the following df,

Result df,

column1 | column2    | column3 | result
   1    | 2023-02-21 |   NaN   |   3
   1    | 2023-02-22 |   NaN   |   3 
   1    | 2023-02-23 |    8    |   3
   1    | 2023-02-24 |   NaN   |   3
   1    | 2023-02-24 |   NaN   |   3
   1    | 2023-02-24 |   NaN   |   3
   1    | 2023-02-24 |   NaN   |   3 
   1    | 2023-02-24 |   10    |   3
   2    | 2023-02-23 |   NaN   |   2
   2    | 2023-02-24 |    9    |   2

I cannot think of a way to achieve this output other than counting column3 NaN's for a given column1 value using pandas. Any help would be greatly appreciated. Thanks.


Solution

  • Not sure if your exact logic, but maybe:

    df['result'] = (df['column3'].isna() 
                      .groupby(df['column1'])
                      .transform(lambda x: x.cummin().sum()+1)
                    )