Search code examples
pythonpandasgroup-byboolean

Issues using ifelse in groupby transformation: boolean error


I have a very large dataframe that is generally structured like this:

df= pd.DataFrame({'org_id': [12023, 34340, 12023, 34953], #organization id
'year': [2010, 2010, 2011, 2012], #year of organization data
'2010_present': [1,1,0,0], #Dummy variable indicating whether the df contains an org's data for 2010
'2011_present': [1,0,1,0], #Dummy variable indicating whether the df contains an org's data for 2011
'2012_present': [0,0,0,1] #Dummy variable indicating whether the df contains an org's data for 2012
})

Now, I am trying to create a column ('following_y') that indicates whether an organization's data exists in the data frame for the following year. For example the row for Org 12023, 2010 would get a value of 1 because Org 12023's data is also present in the dataframe for 2011.

Rows with the latest year of the dataframe would get an input of np.nan. (eg. Organization entries from 2012 would get a NA in the 'following_y' column). Based on the above, it would look like this:

df['following_y']: [1,0,0,np.nan]

I decided the best way to tackle this was with a groupby, transformation combination that looks like the following:

for year in list(df.year.unique())[-1]: 
    df['following_y']=df.loc[df['year']==year].groupby('org_id')[str(year+1)+'_present'].transform(lambda x: 1 if x==1 else 0)

However, I am getting the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I think it has something to do with the x==1 part of my lambda function (maybe my Groupby Series object is somehow not working with this if/else format?) After hours of reading through help pages, I would so appreciate any advice or help you might be able to offer!


Solution

  • On the same lines as Andrej, but simplifying with get:

    df['following_y'] = df.apply(lambda r:r.get(f'{r["year"]+1}_present'), axis=1)
    

    Output:

       org_id  year  2010_present  2011_present  2012_present  following_y
    0   12023  2010             1             1             0          1.0
    1   34340  2010             1             0             0          0.0
    2   12023  2011             0             1             0          0.0
    3   34953  2012             0             0             1          NaN
    

    Note, dependent on the dtype of year, you may need to use f'{int(r["year"]+1)}_present'.