Search code examples
pythonpandasnumpysupervised-learningfillna

Assignment with both fillna() and loc() apparently not working


I've searched for answer around, but I cannot find them.

My goal: I'm trying to fill some missing values in a DataFrame, using supervised learning to decide how to fill it.

My code looks like this: NOTE - THIS FIRST PART IS NOT IMPORTANT, IT IS JUST TO GIVE CONTEXT

train_df = df[df['my_column'].notna()]     #I need to train the model without using the missing data
train_x = train_df[['lat','long']]         #Lat e Long are the inputs
train_y = train_df[['my_column']]          #My_column is the output
clf = neighbors.KNeighborsClassifier(2)
clf.fit(train_x,train_y)                   #clf is the classifies, here we train it
df_x = df[['lat','long']]                  #I need this part to do the prediction
prediction = clf.predict(df_x)             #clf.predict() returns an array
series_pred = pd.Series(prediction)        #now the array is a series
print(series_pred.shape)                   #RETURNS (2381,)
print(series_pred.isna().sum())            #RETURN 0

So far, so good. I have my 2381 predictions (I need only a few of them) and there is no NaN value inside (why would there be a NaN value in the predictions? I just wanted to be sure, as I don't understand my error)

Here I try to assign the predictions to my Dataframe:

#test_1
df.loc[df['my_colum'].isna(), 'my_colum'] = series_pred  #I assign the predictions using .loc()
#test_2
df['my_colum'] =  df['my_colum'].fillna(series_pred)     #Double check: I assign the predictions using .fillna()
print(df['my_colum'].shape)                      #RETURNS (2381,)
print(df['my_colum'].isna().sum())               #RETURN 6

As you can see, it didn't work: the missing values are still 6. I randomly tried a slightly different approach:

#test_3
df[['my_colum']] =  df[['my_colum']].fillna(series_pred)     #Will it work?
print(df[['my_colum']].shape)                        #RETURNS (2381, 1)
print(df[['my_colum']].isna().sum())                 #RETURNS 6

Did not work. I decided to try one last thing: check the fillna result even before assigning the results to the original df:

In[42]:
print(df['my_colum'].fillna(series_pred).isna().sum())  #extreme test
Out[42]:
6

So... where is my very very stupid mistake? Thanks a lot


EDIT 1

To show a little bit of the data,

In[1]:
df.head()
Out[1]:
      my_column      lat    long
 id                                                     
9df   Wil            51     5
4f3   Fabio          47     9
x32   Fabio          47     8   
z6f   Fabio          47     9  
a6f   Giovanni       47     7

Also, I've added info at the beginning of the question


Solution

  • @Ben.T or @Dan should post their own answers, they deserve to be accepted as the correct one.

    Following their hints, I would say that there are two solutions:

    Solution 1 (Best): Use loc()

    The problem

    The problem with the current solution is that df.loc[df['my_column'].isna(), 'my_column'] is expecting to receive X values, where X is the number of missing values. My variable prediction has actually both the prediction for the missing values and for the non missing values

    The solution

    pred_df = df[df['my_column'].isna()]        #For the prediction, use a Dataframe with only the missing values. Problem solved
    df_x = pred_df[['lat','long']]
    prediction = clf.predict(df_x)
    df.loc[df['my_column'].isna(), 'my_column'] = prediction
    

    Solution 2: Use fillna()

    The problem

    The problem with the current solution is that df['my_colum'].fillna(series_pred) requires the indexes of my df to be the same of series_pred, which is impossible in this situation unless you have a simple index in your df, like [0, 1, 2, 3, 4...]

    The solution

    Resetting the index of the df at the very beginning of the code.

    Why is this not the best

    The cleanest way is to do the prediction only when you need it. This approach is easy to obtain with loc(), and I do not know how would you obtain it with fillna() because you would need to preserve the index through the classification

    Edit: series_pred.index = df['my_column'].isna().index Thanks @Dan