I have a list of columns of a dataframe that have NA's in them (below). The dtype
of all these columns is str
.
X_train_objects = ['HomePlanet',
'Destination',
'Name',
'Cabin_letter',
'Cabin_number',
'Cabin_letter_2']
I would like to use SimpleImputer
to fill in the NA's will the most common value (mode). However, I am getting a ValueError: Columns must be same length as key
. What is the reason for this, my code seems correct to me?
Dataframe sample (called X_train
) of the Destination
column being np.NA
s:
{'PassengerId': {47: '0045_02',
128: '0138_02',
139: '0152_01',
347: '0382_01',
430: '0462_01'},
'HomePlanet': {47: 'Mars',
128: 'Earth',
139: 'Earth',
347: nan,
430: 'Earth'},
'CryoSleep': {47: 1, 128: 0, 139: 0, 347: 0, 430: 1},
'Destination': {47: nan, 128: nan, 139: nan, 347: nan, 430: nan},
'Age': {47: 19.0, 128: 34.0, 139: 41.0, 347: 23.0, 430: 50.0},
'VIP': {47: 0, 128: 0, 139: 0, 347: 0, 430: 0},
'RoomService': {47: 0.0, 128: 0.0, 139: 0.0, 347: 348.0, 430: 0.0},
'FoodCourt': {47: 0.0, 128: 22.0, 139: 0.0, 347: 0.0, 430: 0.0},
'ShoppingMall': {47: 0.0, 128: 0.0, 139: 0.0, 347: 0.0, 430: 0.0},
'Spa': {47: 0.0, 128: 564.0, 139: 0.0, 347: 4.0, 430: 0.0},
'VRDeck': {47: 0.0, 128: 207.0, 139: 607.0, 347: 368.0, 430: 0.0},
'Name': {47: 'Mass Chmad',
128: 'Monah Gambs',
139: 'Andan Estron',
347: 'Blanie Floydendley',
430: 'Ronia Sosanturney'},
'Transported': {47: 1, 128: 0, 139: 0, 347: 0, 430: 0},
'Cabin_letter': {47: 'F', 128: 'E', 139: 'F', 347: 'G', 430: 'G'},
'Cabin_number': {47: '10', 128: '5', 139: '32', 347: '64', 430: '67'},
'Cabin_letter_2': {47: 'P', 128: 'P', 139: 'P', 347: 'P', 430: 'S'}}
My Code:
imputer = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
X_train[X_train_objects] = imputer.fit_transform(X_train[X_train_objects].values.reshape(-1,1))[:,0]
UPDATE:
Based on feedback from OP, the strategy that gives the desired result is to do this:
X_train[X_train_objects] = imputer.fit_transform(X_train[X_train_objects].values)
ORIGINAL ANSWER:
Here's what the code in your question does:
X_train[X_train_objects]
, which has shape (5, 6)values
) and changes it to a 1D array of length 30 using .reshape(-1,1)[:,0]
imputer.fit_transform
which returns a result whose shape is the same as its inputX_train[X_train_objects]
which (as mentioned above) has shape (5, 6), or specifically, has only 6 columnsThis gives rise to the error: ValueError: Columns must be same length as key
What I believe you intend is, having massaged the values originally found in X_train[X_train_objects]
, to update the original object by overwriting the original values with the massaged ones. To do this, I think the following should work:
X_train[X_train_objects] = (
imputer.fit_transform(X_train[X_train_objects].values.reshape(-1,1))[:,0]
.reshape(-1,len(X_train_objects)) )