Search code examples
pythontensorflowconv-neural-networkfeature-extraction

ValueError When Combining Numerical, Categorical, and Image Features for Machine Learning Model


I’m encountering a ValueError while attempting to combine numerical, categorical, and image features into a single feature set for a machine learning model. I have followed the steps for feature extraction and preprocessing but am still facing issues.

Here’s a summary of what I’m trying to do:

Load and preprocess numerical and categorical features. Extract and preprocess image features using a pre-trained CNN model. Combine these features into a single dataset.

Code and Error:


# Features and target variable
X = data[["ID",'Thinckness', 'Weight', 'Surface', 'Color', 'Transparence']]
y = data['Material']

# Image loading function
def load_image(image_id, base_path='Camera2/front'):
    # Replace with the path to your images directory
    image_path = f"{base_path}/{image_id}.jpeg"
    try:
        with Image.open(image_path) as img:
            img = img.resize((128, 128))  # Resize image
            return np.array(img)
    except FileNotFoundError:
        # Return NaN or a placeholder image (e.g., all zeros)
        print(f"Image file not found: {image_path}")
        return np.full((128, 128, 3), np.nan)  # Return a placeholder image with NaN values
# Load images
images = np.array([load_image(image_id) for image_id in data['ID']])

from tensorflow.keras.applications import VGG16
from tensorflow.keras.applications.vgg16 import preprocess_input
from tensorflow.keras.models import Model
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer


# Load a pre-trained CNN model for feature extraction
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(128, 128, 3))
model = Model(inputs=base_model.input, outputs=base_model.output)


def extract_features_from_images(images):
    features = []
    for img in images:
        if np.isnan(img).any():  # Check if the image contains NaN values
            features.append(np.zeros((4 * 4 * 512,)))  # Return a zero-filled vector as placeholder
        else:
            img = preprocess_input(img)
            img = np.expand_dims(img, axis=0)
            feature = model.predict(img)
            features.append(feature.flatten())
    return np.array(features)

# Extract image features
image_features = extract_features_from_images(images)


if image_features.ndim == 3:
    # Flatten the image features to 2D: [n_samples, height * width * channels]
    image_features = image_features.reshape(image_features.shape[0], -1)

# Define numerical and categorical features
numeric_features = ['Thinckness', 'Weight', 'Surface']
categorical_features = ['Color', 'Transparence']

# Preprocessing for numerical data
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Handle missing values
    ('scaler', StandardScaler())  # Normalize numerical data
])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Handle missing values
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical data
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

numeric_categorical_features = preprocessor.fit_transform(data[numeric_features + categorical_features])


# Combine numerical, categorical, and image features
combined_features = np.hstack([
    numeric_categorical_features,
    image_features
])


Shape:

Numeric/Categorical Features Shape: (1099, 19)
Image Features Shape: (1099, 8192)

Error:

ValueError                                Traceback (most recent call last)

Cell In[103], line 2
      1 # Combine numerical, categorical, and image features
----> 2 combined_features = np.hstack([
      3     numeric_categorical_features,
      4     image_features
      5 ])


ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 2 dimension(s)

Solution

  • I ran your pipeline with a debugger and found that the OneHotEncoder produces a scipy.sparse.csr_matrix by default. The ColumnTransformer has a parameter sparse_threshold (default: 0.3), which let's it also output sparse matrices if the overall density is lower than the set value.
    This led numeric_categorical_features to be a sparse matrix. Apparently, numpy can't stack scipy's sparse matrices and numpy matrices. To fix this, you have at least 2 options.

    1. Set the OneHotEncoder output directly to non-sparse (Before v1.2 the parameter is called sparse, not sparse_output):
    OneHotEncoder(handle_unknown='ignore', sparse_output=False)
    
    1. Or you force the output of the ColumnTransformer to always be dense (aka numpy arrays), regardless if the input is sparse or not:
    ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
        ], sparse_threshold=0.0)