python python-3.x machine-learning imputation

When i convert my numpy array to Dataframe it update values to Nan

import impyute.imputation.cs as imp

print(Data)
Data = pd.DataFrame(data = imp.em(Data),columns = columns)
print(Data)

When i do the above code all my values gets converted to Nan as below,Can someone help me where am i going wrong?

Before

     Time  LymphNodeStatus    ...      MeanPerimeter  TumorSize
0      31              5.0    ...             117.50        5.0
1      61              2.0    ...             122.80        3.0
2     116              0.0    ...             137.50        2.5
3     123              0.0    ...              77.58        2.0
4      27              0.0    ...             135.10        3.5
5      77              0.0    ...              84.60        2.5

After

     Time  LymphNodeStatus    ...      MeanPerimeter  TumorSize
0     NaN              NaN    ...                NaN        NaN
1     NaN              NaN    ...                NaN        NaN
2     NaN              NaN    ...                NaN        NaN
3     NaN              NaN    ...                NaN        NaN
4     NaN              NaN    ...                NaN        NaN
5     NaN              NaN    ...                NaN        NaN

Solution

Editted

Solution first

Instead of passing columns to pd.DataFrame, just manually assign column names:

data = pd.DataFrame(imp.em(data))
data.columns = columns

Cause

Error lies in Data = pd.DataFrame(data = imp.em(Data),columns = columns).

imp.em has a decorator @preprocess which converts input into a numpy.array if it is a pandas.DataFrame.

...
if pd_DataFrame and isinstance(args[0], pd_DataFrame):
    args[0] = args[0].as_matrix()
    return pd_DataFrame(fn(*args, **kwargs))

It therefore returns a dataframe reconstructed from a matrix, having range(data.shape[1]) as column names.

And as I have pointed below, when pd.DataFrame is instantiated with mismatching columns on another pd.DataFrame, all the contents become NaN.

You can test this by

from impyute.util import preprocess

@preprocess
def test(data):
    return data

data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
columns = data.columns

data = pd.DataFrame(test(data), columns = columns))

size    time
0   NaN NaN
1   NaN NaN
2   NaN NaN

When you instantiate a pd.DataFrame from an existing pd.DataFrame, columns argument specifies which of the columns from original dataframe you want to use.

It does not re-label the dataframe. Which is not odd, just the way pandas intended in reindexing

By default values in the new index that do not have corresponding records in the dataframe are assigned NaN.

# Make new pseudo dataset
data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
data
    size    time
0   3   1
1   2   2
2   1   3

#Make new dataset with original `data`
data = pd.DataFrame(data, columns = ["a", "b"])
data
a   b
0   NaN NaN
1   NaN NaN
2   NaN NaN