import impyute.imputation.cs as imp
print(Data)
Data = pd.DataFrame(data = imp.em(Data),columns = columns)
print(Data)
When i do the above code all my values gets converted to Nan as below,Can someone help me where am i going wrong?
Before
Time LymphNodeStatus ... MeanPerimeter TumorSize
0 31 5.0 ... 117.50 5.0
1 61 2.0 ... 122.80 3.0
2 116 0.0 ... 137.50 2.5
3 123 0.0 ... 77.58 2.0
4 27 0.0 ... 135.10 3.5
5 77 0.0 ... 84.60 2.5
After
Time LymphNodeStatus ... MeanPerimeter TumorSize
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
Editted
Solution first
Instead of passing columns
to pd.DataFrame
, just manually assign column names:
data = pd.DataFrame(imp.em(data))
data.columns = columns
Cause
Error lies in Data = pd.DataFrame(data = imp.em(Data),columns = columns)
.
imp.em
has a decorator @preprocess
which converts input into a numpy.array
if it is a pandas.DataFrame
.
...
if pd_DataFrame and isinstance(args[0], pd_DataFrame):
args[0] = args[0].as_matrix()
return pd_DataFrame(fn(*args, **kwargs))
It therefore returns a dataframe
reconstructed from a matrix, having range(data.shape[1])
as column names.
And as I have pointed below, when pd.DataFrame
is instantiated with mismatching columns
on another pd.DataFrame
, all the contents become NaN
.
You can test this by
from impyute.util import preprocess
@preprocess
def test(data):
return data
data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
columns = data.columns
data = pd.DataFrame(test(data), columns = columns))
size time
0 NaN NaN
1 NaN NaN
2 NaN NaN
When you instantiate a pd.DataFrame
from an existing pd.DataFrame
, columns
argument specifies which of the columns from original dataframe you want to use.
It does not re-label the dataframe. Which is not odd, just the way pandas
intended in reindexing
By default values in the new index that do not have corresponding records in the dataframe are assigned NaN.
# Make new pseudo dataset
data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
data
size time
0 3 1
1 2 2
2 1 3
#Make new dataset with original `data`
data = pd.DataFrame(data, columns = ["a", "b"])
data
a b
0 NaN NaN
1 NaN NaN
2 NaN NaN