python encoding scikit-learn decision-tree one-hot-encoding

Feature interpretation in decision tree after OneHotEncoding

I am trying to understand how are features interpreted by decision tree after performing OneHotEncoding on data to convert categorical data.

Let's say in training data we have 3 features(all categorical) as X1, X2, X3.

X1 has 3 distinct values (a,b,c), X2 has 2 distinct values (e,f) and X3 has 4 distinct values (m,n,o,p).

After encoding, with sparse = False, resultant matrix will be of shape (X.shape[0], 9).

Now while fitting the decision tree model, to calculate information gain, will the model consider this as training set of 9 features or of 3 features?

If 3, how will model know about the no. of columns associated with a feature.

If 9, won't the features lose their importance?

Solution

In every case, your model will work with what you give to it:

If your dataset is of shape (X.shape[0], 9), it would mean that 9 features would have been generated from your 3 categorial. In which case, each feature become a boolean indicator (ie: if the column corresponding to X1 and value "a" have a value of 1 in a row, it would mean this row add value "a" in X1).
In the other case, if your shape is (X.shape[0], 3), each column will have a set of numerical value (ie: for X1 : "a" = 0.33 ; "b"=0.66; "c"=1.0), effectively encoding the characters in your categorial variable.

To precisely answer your questions :

Now while fitting the decision tree model, to calculate information gain, will the model consider this as training set of 9 features or of 3 features?

The model will consider what you give it, if you give him data with shape (X.shape[0], 9) it will compute information gain on 9 features, if (X.shape[0], 3), it will compute information gain on 3 features.

If 3, how will model know about the no. of columns associated with a feature.

The point of OneHotEncoding if only to convert an ensemble of unique string to an ensemble of unique floats/ints. In essence IF gain don't care what your data look like, it's just the scikit-learn algorithm not accepting categorial variables. To IF gain, your OneHotEncoding result have the same "value" has your original data.

If 9, won't the features lose their importance?

If your features are boolean indicator instead of encoded categorial column, do you loose any information ? In my point of view no, you just represented the information originaly present in your dataset in another format !

In theory, with no limit of depth, both approach should hold similar results as you just changed the representation, not the relation between the data and the classes you are trying to learn.

The only difference will be that, if the spliting rules only consider one feature (which is the case in sklearn), a rule based on the value of the an encoded feature (X1 : "a" = 0.33 ; "b"=0.66; "c"=1.0) could seperate more cases in one split than a boolean indicator.