I have been working on this for a while and just can't seem to find the answer to what I need. Suppose I have a dataframe as below.
What I would like to do is fill the last three rows of df['gender']
based on the value in df['home_work']
column, specifically if home_work
> 9, then m
, if not, then f
. Please keep in mind this is just a made up dataset and I don't mean to offend anyone, I promise!
enr = pd.DataFrame({'name_id':[1254, 1359, 1254, 1296, 1353, 2656],
'enrollment_term':['spring 2018', 'spring 2018', 'fall 2018', 'spring 2018', 'spring 2018', 'fall 2020'],
'gpa_term': [2.93, np.nan, 1.65, 4.00, 3.95, 2.92],
'dog_owner':[0,1,1,1, 1, 0],
'salary':[50657, 90658, np.nan, 104352, np.nan, 102043],
'home_work':[34, np.nan, 12, 9, 8, 27],
'gender':['m','f','f',np.nan, np.nan, np.nan]})
enr
Below is the code that I attempted but it presented the error down below:
for i in df['gender'].isna():
if df['home_work'][i] > 9:
df['gender'][i].fillna('m')
else:
df['gender'][i].fillna('f')
KeyError: False
Any help would be greatly appreciated as I have been working on this for a while. I have a dataset of 90K + that I want to adapt this work to and would like to create a function that streamlines this process but have hit a speedbump!
The issue I am running into is that np.nan
defaults and fills in a value for gender
if it doesn't meet the requirement. Thoughts?
Suppose I have the following df:
enr = pd.DataFrame({'name_id':[1254, 1359, 1254, 1296, 1353, 2656],
'enrollment_term':['spring 2018', 'spring 2018', 'fall 2018', 'spring 2018', 'spring 2018', 'fall 2020'],
'gpa_term': [2.93, np.nan, 1.65, 4.00, 3.95, 2.92],
'dog_owner':[0,1,1,1, 1, 0],
'salary':[50657, 90658, np.nan, 104352, np.nan, 102043],
'home_work':[np.nan, np.nan, 0.7, 0.3, 0.64, 0.49],
'gender':[0, 1, 1,np.nan, np.nan, np.nan]})
I would like to impute enr['gender']
based on home_work
. If enr['home_work'] >= 0.5
, then enr['gender'] == 0
, else (as long as enr['home_work'] != np.nan
), enr['gender'] == 1
.
What I don't want is imputation of values in enr[gender]
where their enr['home_work']
is np.nan
I have tried many different techniques but all seem to impute a 1. Thoughts?
Use numpy.where
with Series.fillna
:
enr['gender'] = np.where(enr['home_work'] > 9,
enr['gender'].fillna('m'),
enr['gender'].fillna('f'))
Or filter separately by 2 masks:
m = enr['gender'].isna()
enr.loc[m, 'gender'] = np.where(enr['home_work'] > 9, 'm', 'f')[m]
print (enr)
name_id enrollment_term gpa_term dog_owner salary home_work gender
0 1254 spring 2018 2.93 0 50657.0 34 m
1 1359 spring 2018 NaN 1 90658.0 42 f
2 1254 fall 2018 1.65 1 NaN 12 f
3 1296 spring 2018 4.00 1 104352.0 9 f
4 1353 spring 2018 3.95 1 NaN 8 f
5 2656 fall 2020 2.92 0 102043.0 27 m
EDIT:
m = enr['gender'].isna() & enr['home_work'].notna()
enr.loc[m, 'gender'] = np.where(enr['home_work'] >= 0.5, 0, 1)[m]
print (enr)
name_id enrollment_term gpa_term dog_owner salary home_work gender
0 1254 spring 2018 2.93 0 50657.0 NaN 0.0
1 1359 spring 2018 NaN 1 90658.0 NaN 1.0
2 1254 fall 2018 1.65 1 NaN 0.70 1.0
3 1296 spring 2018 4.00 1 104352.0 0.30 1.0
4 1353 spring 2018 3.95 1 NaN 0.64 0.0
5 2656 fall 2020 2.92 0 102043.0 0.49 1.0