Search code examples
machine-learningclassificationcategorical-datasupervised-learning

Numerical and Categorical Features in classification problem


I have a classification problem to figure out hotel cancellations (in python).

I'm stuck in a problem of the first steps.

I have some variables regarding hotel reservations, and some of them are:

  • ArrivalDateYear: Year of the arrival date
  • ArrivalDateWeekNumber: Week number of the arrival date
  • ArrivalDateDayOfMonth: Day of the month of the arrival date

The 'ArrivalDateYear' is composed by only 3 years, so i assume i had to handle this variable as a 'categorical' or 'non_metric' feature.

Now for the other two variables i'm stuck, its 31 days for one and xx weeks for another. Should I deal with them has 'numerical' features? Should I just ignore them? Or should i handle them as 'categorical' feature?

For the programming part, should i put: data['ArrivalDateYear'] = data['ArrivalDateYear'].astype('category') (...)?

Is there any other way to handle 'month','days' etc variables in a simpe Machine Learning Supervised Problem?


Solution

  • I would make these variables categorical. There are two approaches that you can try to implement depending on your case:

    • convert them into separate binary variables, where each category represents a unique week number or day of the month (it helps if there are non-linear relationships between dates and target)

    • extract from this new feature: for example, you can derive features like "IsWeekend" or "IsHoliday". It will be more helpful (IMO)