I come from an R background and I'm trying to replicate the mutate()
function from dplyr in pandas.
I have a dataframe that looks like this:
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
I am now trying to create a new column called age_bracket
using assign
method as follows:
(df.
assign(age_bracket= lambda x: "under 25" if x['age'] < 25 else
("25-34" if x['age'] < 35 else "35+"))
And this is throwing the following error which I'm not able to understand:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
I am not interested the following solution:
df['age_bracket'] = np.where(df.age < 25, 'under 25',
(np.where(df.age < 35, "25-34", "35+")))
As I do not want the underlying df to change. I'm trying to get better at method chaining where I can quickly explore my df in different ways without changing the underlying df.
Any suggestions?
It is possible, but not recommended, because loops (under the hood of apply
function):
df = (df.
assign(age_bracket= lambda x: x['age'].apply(lambda y: "under 25" if y < 25 else
("25-34" if y < 35 else "35+"))))
print (df)
name age preTestScore postTestScore age_bracket
0 Jason 42 4 25 35+
1 Molly 52 24 94 35+
2 Tina 36 31 57 35+
3 Jake 24 2 62 under 25
4 Amy 73 3 70 35+
Or numpy.select
:
df = df.assign(age_bracket= np.select([df.age < 25,df.age < 35], ['under 25', "25-34"], "35+"))
But better is use cut
here:
df = (df.assign(age_bracket= lambda x: pd.cut(x['age'],
bins=[0, 25, 35, 150],
labels=["under 25", "25-34", "35+"])))