I am attempting to impute Null values with an offset that corresponds to the average of the row df[row,'avg'] and average of the column ('impute[col]'). Is there a way to do this that would make the method parallelize with .map? Or is there a better way to iterate through the indexes containing Null values?
test = pd.DataFrame({'a':[None,2,3,1], 'b':[2,np.nan,4,2],
'c':[3,4,np.nan,3], 'avg':[2.5,3,3.5,2]});
df = df[['a', 'b', 'c', 'avg']];
impute = dict({'a':2, 'b':3.33, 'c':6 } )
def smarterImpute(df, impute):
df2 = df
for col in df.columns[:-1]:
for row in test.index:
if pd.isnull(df.loc[row,col]):
df2.loc[row, col] = impute[col]
+ (df.loc[:,'avg'].mean() - df.loc[row,'avg'] )
return print(df2)
smarterImpute(test, impute)
Notice that in your 'filling' expression:
impute[col] + (df.loc[:,'avg'].mean() - df.loc[row,'avg']`
The first term only depends on the column and the third only on the row; the second is just a constant. So we can create an imputation dataframe to look up whenever there's a value that needs to be filled:
impute_df = pd.DataFrame(impute, index = test.index).add(test.avg.mean() - test.avg, axis = 0)
Then, there's a method in called .combine_first()
that allows you fill the NAs in one dataframe with the values of another, which is exactly what we need. We use this, and we're done:
test.combine_first(impute_df)
With pandas, you generally want to avoid using loops, and seek to make use of vectorization.