Search code examples
pandasdataframescikit-learnpython-3.5sklearn-pandas

Pandas fill cells in a column with NaN values, derive the value from other cells in the row


I have a dataframe:

     a    b      c
0    1    2      3 
1    1    1      1
2    3    7      NaN
3    2    3      5
...

I want to fill column "three" inplace (update the values) where the values are NaN using a machine learning algorithm.

I don't know how to do it inplace. Sample code:

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df=pd.DataFrame([range(3), [1, 5, np.NaN], [2, 2, np.NaN], [4,5,9], [2,5,7]],columns=['a','b','c'])
x=[]
y=[]
for row in df.iterrows():
    index,data = row
    if(not pd.isnull(data['c'])):
        x.append(data[['a','b']].tolist())
        y.append(data['c'])

model = LinearRegression()
model.fit(x,y)

#this line does not do it in place.
df[~df.c.notnull()].assign(c = lambda x:model.predict(x[['a','b']]))

But this gives me a copy of the dataframe. Only option I have left is using a for loop however, I don't want to do that. I think there should be more pythonic way of doing it using pandas. Can someone please help? Or is there any other way of doing this?


Solution

  • You'll have to do something like :

    df.loc[pd.isnull(df['three']), 'three'] = _result of model_

    This modifies directly dataframe df

    This way you first filter the dataframe to keep the slice you want to modify (pd.isnull(df['three'])), then from that slice you select the column you want to modify (three).

    On the right hand side of the equal, it expects to get an array / list / series with the same number of lines than the filtered dataframe ( in your example, one line)

    You may have to adjust depending on what your model returns exactly

    EDIT

    You probably need to do stg like this

    pred = model.predict(df[['a', 'b']])
    df['pred'] = model.predict(df[['a', 'b']])
    df.loc[pd.isnull(df['c']), 'c'] = df.loc[pd.isnull(df['c']), 'pred']
    

    Note that a significant part of the issue comes from the way you are using scikit learn in your example. You need to pass the whole dataset to the model when you predict.