Search code examples
pythonpandasscikit-learnimputation

sklearn.impute.SimpleImputer: Unable to fill in the most common value for a list of dataframe columns


I have a list of columns of a dataframe that have NA's in them (below). The dtype of all these columns is str.

X_train_objects = ['HomePlanet',
 'Destination',
 'Name',
 'Cabin_letter',
 'Cabin_number',
 'Cabin_letter_2']

I would like to use SimpleImputer to fill in the NA's will the most common value (mode). However, I am getting a ValueError: Columns must be same length as key. What is the reason for this, my code seems correct to me?

Dataframe sample (called X_train) of the Destination column being np.NAs:

{'PassengerId': {47: '0045_02',
  128: '0138_02',
  139: '0152_01',
  347: '0382_01',
  430: '0462_01'},
 'HomePlanet': {47: 'Mars',
  128: 'Earth',
  139: 'Earth',
  347: nan,
  430: 'Earth'},
 'CryoSleep': {47: 1, 128: 0, 139: 0, 347: 0, 430: 1},
 'Destination': {47: nan, 128: nan, 139: nan, 347: nan, 430: nan},
 'Age': {47: 19.0, 128: 34.0, 139: 41.0, 347: 23.0, 430: 50.0},
 'VIP': {47: 0, 128: 0, 139: 0, 347: 0, 430: 0},
 'RoomService': {47: 0.0, 128: 0.0, 139: 0.0, 347: 348.0, 430: 0.0},
 'FoodCourt': {47: 0.0, 128: 22.0, 139: 0.0, 347: 0.0, 430: 0.0},
 'ShoppingMall': {47: 0.0, 128: 0.0, 139: 0.0, 347: 0.0, 430: 0.0},
 'Spa': {47: 0.0, 128: 564.0, 139: 0.0, 347: 4.0, 430: 0.0},
 'VRDeck': {47: 0.0, 128: 207.0, 139: 607.0, 347: 368.0, 430: 0.0},
 'Name': {47: 'Mass Chmad',
  128: 'Monah Gambs',
  139: 'Andan Estron',
  347: 'Blanie Floydendley',
  430: 'Ronia Sosanturney'},
 'Transported': {47: 1, 128: 0, 139: 0, 347: 0, 430: 0},
 'Cabin_letter': {47: 'F', 128: 'E', 139: 'F', 347: 'G', 430: 'G'},
 'Cabin_number': {47: '10', 128: '5', 139: '32', 347: '64', 430: '67'},
 'Cabin_letter_2': {47: 'P', 128: 'P', 139: 'P', 347: 'P', 430: 'S'}}

My Code:

imputer = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
X_train[X_train_objects] = imputer.fit_transform(X_train[X_train_objects].values.reshape(-1,1))[:,0]

Solution

  • UPDATE:

    Based on feedback from OP, the strategy that gives the desired result is to do this:

    X_train[X_train_objects] = imputer.fit_transform(X_train[X_train_objects].values)
    

    ORIGINAL ANSWER:

    Here's what the code in your question does:

    • works with X_train[X_train_objects], which has shape (5, 6)
    • converts it to a numpy array (via values) and changes it to a 1D array of length 30 using .reshape(-1,1)[:,0]
    • passes this as an argument to imputer.fit_transform which returns a result whose shape is the same as its input
    • attempts to use this 1D array of length 30 to update all rows in X_train[X_train_objects] which (as mentioned above) has shape (5, 6), or specifically, has only 6 columns

    This gives rise to the error: ValueError: Columns must be same length as key

    What I believe you intend is, having massaged the values originally found in X_train[X_train_objects], to update the original object by overwriting the original values with the massaged ones. To do this, I think the following should work:

    X_train[X_train_objects] = (
        imputer.fit_transform(X_train[X_train_objects].values.reshape(-1,1))[:,0]
        .reshape(-1,len(X_train_objects)) )