I have the following dataset
data = {'ID': ['A', 'B', 'C', 'D'],
'2012': [0, 1, 1, 1],
'2013': [0, 0, 1, 1],
'2014': [0, 0, 0, 1],
'2015': [0, 0, 1, 1],
'2016': [0, 0, 1, 0],
'2017': [1, 0, 1,1]}
df = pd.DataFrame(data)
For each row I want to generate a new column - Baseline_Year
- which assumes the name of the column with all values to the right that are equal to 1. In case there is not column with all the values equal to 1, I would like the Baseline_Year
to be equal to missing.
See the expected results
data = {'ID': ['A', 'B', 'C', 'D', 'E'],
'2012': [0, 1, 1, 1, 1],
'2013': [0, 0, 1, 1, 1],
'2014': [0, 0, 0, 1, 1],
'2015': [0, 0, 1, 1, 1],
'2016': [0, 0, 1, 0, 1],
'2017': [1, 0, 1,1, 1],
'Baseline_Year': [np.nan, np.nan, '2015','2017', '2012'],
}
df_results = pd.DataFrame(data)
df_results
I would use a boolean mask and idxmax
:
# get year columns, identify rightmost 1s
m = (df.filter(regex=r'\d+')
.loc[:, ::-1]
.eq(1).cummin(axis=1)
.loc[:, ::-1]
)
df['Baseline_Year'] = m.idxmax(axis=1).where(m.any(axis=1))
Output:
ID 2012 2013 2014 2015 2016 2017 Baseline_Year
0 A 0 0 0 0 0 1 2017
1 B 1 0 0 0 0 0 NaN
2 C 1 1 0 1 1 1 2015
3 D 1 1 1 1 0 1 2017
If you want a minimum number of 1s on the right:
N = 2
df['Baseline_Year'] = m.idxmax(axis=1).where(m.sum(axis=1).ge(N))
Output:
ID 2012 2013 2014 2015 2016 2017 Baseline_Year
0 A 0 0 0 0 0 1 NaN
1 B 1 0 0 0 0 0 NaN
2 C 1 1 0 1 1 1 2015
3 D 1 1 1 1 0 1 NaN
Intermediate m
:
2012 2013 2014 2015 2016 2017
0 False False False False False True
1 False False False False False False
2 False False False True True True
3 False False False False False True