I've searched for answer around, but I cannot find them.
My goal: I'm trying to fill some missing values in a DataFrame, using supervised learning to decide how to fill it.
My code looks like this: NOTE - THIS FIRST PART IS NOT IMPORTANT, IT IS JUST TO GIVE CONTEXT
train_df = df[df['my_column'].notna()] #I need to train the model without using the missing data
train_x = train_df[['lat','long']] #Lat e Long are the inputs
train_y = train_df[['my_column']] #My_column is the output
clf = neighbors.KNeighborsClassifier(2)
clf.fit(train_x,train_y) #clf is the classifies, here we train it
df_x = df[['lat','long']] #I need this part to do the prediction
prediction = clf.predict(df_x) #clf.predict() returns an array
series_pred = pd.Series(prediction) #now the array is a series
print(series_pred.shape) #RETURNS (2381,)
print(series_pred.isna().sum()) #RETURN 0
So far, so good. I have my 2381 predictions (I need only a few of them) and there is no NaN value inside (why would there be a NaN value in the predictions? I just wanted to be sure, as I don't understand my error)
Here I try to assign the predictions to my Dataframe:
#test_1
df.loc[df['my_colum'].isna(), 'my_colum'] = series_pred #I assign the predictions using .loc()
#test_2
df['my_colum'] = df['my_colum'].fillna(series_pred) #Double check: I assign the predictions using .fillna()
print(df['my_colum'].shape) #RETURNS (2381,)
print(df['my_colum'].isna().sum()) #RETURN 6
As you can see, it didn't work: the missing values are still 6. I randomly tried a slightly different approach:
#test_3
df[['my_colum']] = df[['my_colum']].fillna(series_pred) #Will it work?
print(df[['my_colum']].shape) #RETURNS (2381, 1)
print(df[['my_colum']].isna().sum()) #RETURNS 6
Did not work. I decided to try one last thing: check the fillna result even before assigning the results to the original df:
In[42]:
print(df['my_colum'].fillna(series_pred).isna().sum()) #extreme test
Out[42]:
6
So... where is my very very stupid mistake? Thanks a lot
To show a little bit of the data,
In[1]:
df.head()
Out[1]:
my_column lat long
id
9df Wil 51 5
4f3 Fabio 47 9
x32 Fabio 47 8
z6f Fabio 47 9
a6f Giovanni 47 7
Also, I've added info at the beginning of the question
@Ben.T or @Dan should post their own answers, they deserve to be accepted as the correct one.
Following their hints, I would say that there are two solutions:
The problem
The problem with the current solution is that df.loc[df['my_column'].isna(), 'my_column']
is expecting to receive X values, where X is the number of missing values. My variable prediction
has actually both the prediction for the missing values and for the non missing values
The solution
pred_df = df[df['my_column'].isna()] #For the prediction, use a Dataframe with only the missing values. Problem solved
df_x = pred_df[['lat','long']]
prediction = clf.predict(df_x)
df.loc[df['my_column'].isna(), 'my_column'] = prediction
The problem
The problem with the current solution is that df['my_colum'].fillna(series_pred)
requires the indexes of my df
to be the same of series_pred
, which is impossible in this situation unless you have a simple index in your df, like [0, 1, 2, 3, 4...]
The solution
Resetting the index of the df at the very beginning of the code.
Why is this not the best
The cleanest way is to do the prediction only when you need it. This approach is easy to obtain with loc(), and I do not know how would you obtain it with fillna() because you would need to preserve the index through the classification
Edit: series_pred.index = df['my_column'].isna().index
Thanks @Dan