Search code examples
pandasdataframeregressionsklearn-pandasfillna

Fill Pandas Column NaNs with numpy array values


Sorry if this question seems too for newbies but I've been looking for an answer I didn't find it.

So, I have a dataset with lots of NaN values and I've been working on some regressions to predict those nulls, and since the prediction is given as a numpy.ndarray, I've trying to fill the gaps of the columns with those arrays with no success.

I mean, the column is something like this:

           ['Records']
      101       21
      102       22
      103       23 
      104       24
      106       NaN
      107       NaN
      108       NaN
      109       NaN
      110       NaN
      111       29
      112       30

The array is:

   y_pred = [25, 26, 27, 28]

So, fillna doesn't handle numpy arrays to do the job, and my attempts were set the array as dict, pandas column, etc. but nothing worked.

Also, the other issue is the lenght of the array which always will be different from the original column.

I appreciate your insights.


Solution

  • First is necessary same number of missing values like length of array, if want replace all missing values by all values of array:

    #added value
    y_pred = [25, 26, 27, 28, 30]
    m = df['Records'].isna()
    
    df.loc[m, 'Records'] = y_pred
    print (df)
         Records
    101     21.0
    102     22.0
    103     23.0
    104     24.0
    106     25.0
    107     26.0
    108     27.0
    109     28.0
    110     30.0
    111     29.0
    112     30.0
    

    If is possible length not matched create helper Series with filter by lengths and pass to Series.fillna:

    Here array has length < number of NaNs:

    y_pred = [25, 26, 27, 28]
    
    m = df['Records'].isna()
    
    LenNaN = m.sum()
    LenArr = len(y_pred)
    
    s = pd.Series(y_pred[:LenNaN], index=df.index[m][:LenArr])
    print (s)
    106    25
    107    26
    108    27
    109    28
    dtype: int64
    
    df['Records'] = df['Records'].fillna(s)
    print (df)
         Records
    101     21.0
    102     22.0
    103     23.0
    104     24.0
    106     25.0
    107     26.0
    108     27.0
    109     28.0
    110      NaN
    111     29.0
    112     30.0
    

    Here array has length > number of NaNs:

    y_pred = [25, 26, 27, 28, 100, 200, 300]
    
    m = df['Records'].isna()
    
    LenNaN = m.sum()
    LenArr = len(y_pred)
    
    s = pd.Series(y_pred[:LenNaN], index=df.index[m][:LenArr])
    print (s)
    106     25
    107     26
    108     27
    109     28
    110    100
    dtype: int64
    
    df['Records'] = df['Records'].fillna(s)
    print (df)
         Records
    101     21.0
    102     22.0
    103     23.0
    104     24.0
    106     25.0
    107     26.0
    108     27.0
    109     28.0
    110    100.0
    111     29.0
    112     30.0