Search code examples
amazon-web-servicesmachine-learningamazon-machine-learning

AWS Machine Learning Data


I'm using the AWS Machine Learning regression to predict the waiting time in a line of a restaurant, in a specific weekday/time. Today I have around 800k data.

Example Data:

restaurantID (rowID)weekDay (categorical)time (categorical)tablePeople (numeric)waitingTime (numeric - target)
1                               sun                              21:29                  2                                 23                                            
2                               fri                                 20:13                  4                                 43                                            
...


I have two questions:

1) Should I use time as Categorical or Numeric? It's better to split into two fields: minutes and seconds?

2) I would like in the same model to get the predictions for all my restaurants.

Example: I expected to send the rowID identifier and it returns different predictions, based on each restaurant data (ignoring others data).

I tried, but it's returning the same prediction for any rowID. Why?

Should I have a model for each restaurant?


Solution

  • There are several problems with the way you set-up your model

    1) Time in the form you have it should never be categorical. Your model treats times 12:29 and 12:30 as two completely independent attributes. So it will never use facts it learn about 12:29 to predict what's going to happen at 12:30. In your case you either should set time to be numeric. Not sure if amazon ML can convert it for you automatically. If not just multiply hour by 60 and add minutes to it. Another interesting thing to do is to bucketize your time, by selecting which half hour or wider interval. You do it by dividing (h*60+m) by some number depending how many buckets you want. So to try 120 to get 2 hr intervals. Generally the more data you have the smaller intervals you can have. The key is to have a lot of samples in each bucket.

    2) You should really think about removing restaurantID from your input data. Having it there will cause the model to over-fit on it. So it will not be able to make predictions about restaurant with id:5 based on the facts it learn from restaurants with id:3 or id:9. Having restaurant id there might be okay if you have a lot of data about each restaurant and you don't care about extrapolating your predictions to the restaurants that are not in the training set.

    3) You never send restaurantID to predict data about it. The way it usually works you need to pick what are you trying to predict. In your case probably 'waitingTime' is most useful attribute. So you need to send weekDay, time and number of people and the model will output waiting time.