Search code examples
pythonpython-3.xmachine-learningimputation

When i convert my numpy array to Dataframe it update values to Nan


import impyute.imputation.cs as imp

print(Data)
Data = pd.DataFrame(data = imp.em(Data),columns = columns)
print(Data)

When i do the above code all my values gets converted to Nan as below,Can someone help me where am i going wrong?

Before

     Time  LymphNodeStatus    ...      MeanPerimeter  TumorSize
0      31              5.0    ...             117.50        5.0
1      61              2.0    ...             122.80        3.0
2     116              0.0    ...             137.50        2.5
3     123              0.0    ...              77.58        2.0
4      27              0.0    ...             135.10        3.5
5      77              0.0    ...              84.60        2.5

After

     Time  LymphNodeStatus    ...      MeanPerimeter  TumorSize
0     NaN              NaN    ...                NaN        NaN
1     NaN              NaN    ...                NaN        NaN
2     NaN              NaN    ...                NaN        NaN
3     NaN              NaN    ...                NaN        NaN
4     NaN              NaN    ...                NaN        NaN
5     NaN              NaN    ...                NaN        NaN

Solution

  • Editted

    Solution first

    Instead of passing columns to pd.DataFrame, just manually assign column names:

    data = pd.DataFrame(imp.em(data))
    data.columns = columns
    

    Cause

    Error lies in Data = pd.DataFrame(data = imp.em(Data),columns = columns).

    imp.em has a decorator @preprocess which converts input into a numpy.array if it is a pandas.DataFrame.

    ...
    if pd_DataFrame and isinstance(args[0], pd_DataFrame):
        args[0] = args[0].as_matrix()
        return pd_DataFrame(fn(*args, **kwargs))
    

    It therefore returns a dataframe reconstructed from a matrix, having range(data.shape[1]) as column names.

    And as I have pointed below, when pd.DataFrame is instantiated with mismatching columns on another pd.DataFrame, all the contents become NaN.

    You can test this by

    from impyute.util import preprocess
    
    @preprocess
    def test(data):
        return data
    
    data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
    columns = data.columns
    
    data = pd.DataFrame(test(data), columns = columns))
    
    size    time
    0   NaN NaN
    1   NaN NaN
    2   NaN NaN
    

    When you instantiate a pd.DataFrame from an existing pd.DataFrame, columns argument specifies which of the columns from original dataframe you want to use.

    It does not re-label the dataframe. Which is not odd, just the way pandas intended in reindexing

    By default values in the new index that do not have corresponding records in the dataframe are assigned NaN.

    # Make new pseudo dataset
    data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
    data
        size    time
    0   3   1
    1   2   2
    2   1   3
    
    #Make new dataset with original `data`
    data = pd.DataFrame(data, columns = ["a", "b"])
    data
    a   b
    0   NaN NaN
    1   NaN NaN
    2   NaN NaN