Search code examples
pythonrmachine-learninglogistic-regressionfeature-selection

How to handle date variable in machine learning data pre-processing


I have a data-set that contains among other variables the time-stamp of the transaction in the format 26-09-2017 15:29:32. I need to find possible correlations and predictions of the sales (lets say in logistic regression). My questions are:

  1. How to handle the date format? Shall I convert it to one number (like excel does automatically)? Shall I split it in more variables like day, month, year, hour, mins, seconds? any other possible suggestions?
  2. What if I would like to add distinct week number per year? shall I add variable like 342017(week 34 of year 2017)?
  3. Shall I make the same for question 2 for quarter of year?
#         Datetime               Gender        Purchase
1    23/09/2015 00:00:00           0             1
2    23/09/2015 01:00:00           1             0
3    25/09/2015 02:00:00           1             0
4    27/09/2015 03:00:00           1             1
5    28/09/2015 04:00:00           0             0

Solution

  • Some random thoughts:

    Dates are good sources for feature engineering, I don't think there is one method to use dates in a model. Business user expertise would be great; are there observed trends that can be coded into the data?

    Possible suggestions of features include:

    • weekends vs weekdays
    • business hours and time of day
    • seasons
    • week of year number
    • month
    • year
    • beginning/end of month (pay days)
    • quarter
    • days to/from an action event(distance)
    • missing or incomplete data
    • etc.

    All this depends on the data set and most won't apply.

    some links:

    http://appliedpredictivemodeling.com/blog/2015/7/28/feature-engineering-versus-feature-extraction

    https://www.salford-systems.com/blog/dan-steinberg/using-dates-in-data-mining-models

    http://trevorstephens.com/kaggle-titanic-tutorial/r-part-4-feature-engineering/