Search code examples
pythonmachine-learningscikit-learndata-science

How to perform standardization and normalization on features from different feature engineering process?


I'm working with a dataset where each sample contains both numeric and text data. Therefore multiple methods are employed to build the training feature matrix from the dataset. For each sample in the dataset, I construct a vector representation from 3 parts.

  1. Doc2Vec vector representation for paragraph text: I use the gensim implemetation of paragraph vector to encode the text into a 100-D vetors of floats between [-5, 5]

  2. One-hot encoded vector for text label: Each sample in the dataset has zero or more text label, I aggregate out all of the unique labels used in the dataset and encode it into a binary array containing only 0 and 1. For example, if the complete set of labels is [Python, Java, JavaScript, C++] and a sample contains labels Python and Java, the resulted vector will be [1, 1, 0, 0].

  3. Numeric data & categorical data:

    • Numeric data fields are built into the feature vector as is
    • Categorical data are mapped to integers and built into the feature vector

The resulted feature matrix looks something like below

[
  [-1.02, 1.33, 2.35, -0.48, ... -4.11, 1, 0, 1, 1, 0, 0, ..., 1, 0, 235, 11.5, 333],
  [-0.22, 3.03, 1.95, -0.48, ... -4.11, 0, 1, 1, 1, 0, 0, ..., 0, 0, 233, 22, 333],
  [-2.07, -1.33, -2.35, -0.48, ... -4.11, 1, 1, 0, 1, 1, 0, ..., 1, 1, 102, 13, 333],
  [-4.32, 4.33, 1.75, -0.48, ... -4.11, 0, 0, 0, 1, 0, 1, ..., 1, 0, 98, 8, 333],
]

Should I apply any standardization or normalization on the dataset? If so, should I do it before or after concatenating different parts of feature?

I'm using scikit-learn and the major algorithm I using will be Gradient Boosting.


Solution

  • Yes, you need to process features separately: you should apply standardization or normalization only on the original numerical features, you shouldn't do it for doc2vec, OHE or encoded categorical features.