I have a very large dataframe that is generally structured like this:
df= pd.DataFrame({'org_id': [12023, 34340, 12023, 34953], #organization id
'year': [2010, 2010, 2011, 2012], #year of organization data
'2010_present': [1,1,0,0], #Dummy variable indicating whether the df contains an org's data for 2010
'2011_present': [1,0,1,0], #Dummy variable indicating whether the df contains an org's data for 2011
'2012_present': [0,0,0,1] #Dummy variable indicating whether the df contains an org's data for 2012
})
Now, I am trying to create a column ('following_y') that indicates whether an organization's data exists in the data frame for the following year. For example the row for Org 12023, 2010 would get a value of 1 because Org 12023's data is also present in the dataframe for 2011.
Rows with the latest year of the dataframe would get an input of np.nan. (eg. Organization entries from 2012 would get a NA in the 'following_y' column). Based on the above, it would look like this:
df['following_y']: [1,0,0,np.nan]
I decided the best way to tackle this was with a groupby, transformation combination that looks like the following:
for year in list(df.year.unique())[-1]:
df['following_y']=df.loc[df['year']==year].groupby('org_id')[str(year+1)+'_present'].transform(lambda x: 1 if x==1 else 0)
However, I am getting the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I think it has something to do with the x==1 part of my lambda function (maybe my Groupby Series object is somehow not working with this if/else format?) After hours of reading through help pages, I would so appreciate any advice or help you might be able to offer!
On the same lines as Andrej, but simplifying with get
:
df['following_y'] = df.apply(lambda r:r.get(f'{r["year"]+1}_present'), axis=1)
Output:
org_id year 2010_present 2011_present 2012_present following_y
0 12023 2010 1 1 0 1.0
1 34340 2010 1 0 0 0.0
2 12023 2011 0 1 0 0.0
3 34953 2012 0 0 1 NaN
Note, dependent on the dtype
of year
, you may need to use f'{int(r["year"]+1)}_present'
.