Feature Selection from Mixed dataset

I am a newbie in data science domain.

I have a data set, which has both numerical and string data.The interesting fact is both type of data make sense for the outcome. How to choose the relevant features from the data set?

Should I be using the LabelEncoder and convert the data from string to numerical and continue with the correlation? I am taking the right path? Is there any better way to solve this crisis?

Solution

You can encode categorical variables with label encoding if there is a meaningful ordering of available values and making sure the ordering is retained in the encoding. See here for an example.

If there's no ordering (or resolving a meaningful one is too much work) you can use one-hot encoding. This, however will increase the feature set proportionally to the distinct values for the feature in the dataset.

If one-hot results in a very large feature set and the categorical string data are natural language words, you may want to use a pretrained embedding.

Either way, you can then concatenate the encoded categorical column(s) to the continuous feature set and proceed with learning and feature selection.