Search code examples
machine-learningdatasetgrid-search

Whether parameter selection on small dataset will suit for total dataset


When I do parameter selection, it always needs multiply of choices to do grid-search. For example, I want to check NumOfTrees and MaxDepth for Random Forest tree, if I have M1 choices in the first parameter and M2 choices in the second parameter, it will search M1*M2 possibility.

So it's expensive to do parameter search on the total dataset if it's very big.

My Question is, whether I could use a smaller dataset (like 180 days for total data, but 30 days for smaller one) to do the parameter search, and treat the selected parameters as also best on total one? If not, how much differences between them? Thanks.


Solution

  • That depends on whether your 30-days data is representative for your entire time duration. In another word, your target should have the similar distribution over input features for i) the 30 days you used for parameter selection and ii) the future time you would like to predict.

    For example. below case won't work:

    Your data could have some kind of seasonality. Your September customer purchase data won't be good to tune parameters to predict Christmas season customer transactions. Usually during Christmas season, the traffic is significantly larger and the type/category of products are very different.