Search code examples
pythonpandassumrow

Pandas: Find the left-most value in a pandas dataframe followed by all 1s


I have the following dataset

data = {'ID': ['A', 'B', 'C', 'D'], 
        '2012': [0, 1, 1, 1], 
        '2013': [0, 0, 1, 1], 
        '2014': [0, 0, 0, 1], 
        '2015': [0, 0, 1, 1], 
        '2016': [0, 0, 1, 0], 
        '2017': [1, 0, 1,1]}

df  = pd.DataFrame(data)

For each row I want to generate a new column - Baseline_Year - which assumes the name of the column with all values to the right that are equal to 1. In case there is not column with all the values equal to 1, I would like the Baseline_Year to be equal to missing.

See the expected results

data = {'ID': ['A', 'B', 'C', 'D', 'E'], 
        '2012': [0, 1, 1, 1, 1], 
        '2013': [0, 0, 1, 1, 1], 
        '2014': [0, 0, 0, 1, 1], 
        '2015': [0, 0, 1, 1, 1], 
        '2016': [0, 0, 1, 0, 1], 
        '2017': [1, 0, 1,1, 1],
        'Baseline_Year': [np.nan, np.nan, '2015','2017', '2012'],
       }

df_results  = pd.DataFrame(data)

df_results

Solution

  • I would use a boolean mask and idxmax:

    # get year columns, identify rightmost 1s
    m = (df.filter(regex=r'\d+')
           .loc[:, ::-1]
           .eq(1).cummin(axis=1)
           .loc[:, ::-1]
         )
    
    df['Baseline_Year'] = m.idxmax(axis=1).where(m.any(axis=1))
    

    Output:

    
      ID  2012  2013  2014  2015  2016  2017 Baseline_Year
    0  A     0     0     0     0     0     1          2017
    1  B     1     0     0     0     0     0           NaN
    2  C     1     1     0     1     1     1          2015
    3  D     1     1     1     1     0     1          2017
    

    If you want a minimum number of 1s on the right:

    N = 2
    
    df['Baseline_Year'] = m.idxmax(axis=1).where(m.sum(axis=1).ge(N))
    

    Output:

    
      ID  2012  2013  2014  2015  2016  2017 Baseline_Year
    0  A     0     0     0     0     0     1           NaN
    1  B     1     0     0     0     0     0           NaN
    2  C     1     1     0     1     1     1          2015
    3  D     1     1     1     1     0     1           NaN
    

    Intermediate m:

    
        2012   2013   2014   2015   2016   2017
    0  False  False  False  False  False   True
    1  False  False  False  False  False  False
    2  False  False  False   True   True   True
    3  False  False  False  False  False   True