Search code examples
machine-learningscikit-learnlabel-encoding

Difference between ordinal and categorical data as labels in scikit learn


I know that as features ordinal data could be assigned arbitrary numbers and OneHotEncoding could be done for categorical data. But I am a bit confused how these two types of data should be handled when they are the feature to be predicted. For instance in the iris dataset in scikitlearn:

iris = datasets.load_iris()
X = iris.data
y = iris.target

while the y represent three type of flowers which is a categorical data (if im not wrong?!), it is encoded as ordinal values of 0,1,2 (type=int32). My dataset also includes 3 independent categories ('sick','carrier','healthy') and scikitlearn accept them as as strings without any type of encoding.

I was wondering whether it is correct to keep them as they are to be used by scikitlearn or similar encoding as it is done for iris dataset is required?


Solution

  • It seems that in ML we are either working with continuous data that will be handled by regression models or they are categorical which will be handled by classification models. There is no separate category for ordinal data.