Search code examples
pythontabularfast-ai

fastai tabular model - how to get predictions for new data?


I am using kaggle house prices dataset, it is divided into: train and test

  • I built a model with fastai tabular using train set
  • How can I predict values for test data set?

I know it sounds easy and most other libs would do it like model.predict(test), but it is not the case here. I have searched fastai forums and SO and docs. There are quite a few topics regarding this issue and most of them either don't have an answer or are outdated workarounds (since fastai2 was released recently and is now called just fastai).

a. model.predict only works for a single row and looping through test is not optimal. It is very slow.

b. model.get_preds give results for the data you trained on

Please suggest can you predict new df using trained learner for tabular data.


Solution

  • I found a problem. For future readers - why can't you get get_preds work for new df?

    (tested on kaggle's house prices advanced)

    The root of the problem was in categorical nans. If you train your model with one set of cat features, say color = red, green, blue; and your new df has colors: red, green, blue, black - it will throw an error because it won't know what to do with new class (black). Not to mention you need to have the same columns everywhere, which can be tricky since if you use fillmissing proc, like I did, it's nice, it would create new cols for cat values (was missing or not). So you need to triple check these nans in cats. I really wanted to make it work start to finish with fastai:

    Columns for train/test are identical, only train has 1 extra - target. At this point there are different classes in some cat cols. I just decided to combine them (jus to make it work), but doesn't it introduce leakage?

    combined = pd.concat([train, test]) # test will have nans at target, but we don't care
    cont_cols, cat_cols = cont_cat_split(combined, max_card=50)
    combined = combined[cat_cols]
    

    Some tweaking while we at it.

    train[cont_cols] = train[cont_cols].astype('float') # if target is not float, there will be an error later
    test[cont_cols[:-1]] = test[cont_cols[:-1]].astype('float'); # slice target off (I had mine at the end of cont_cols)
    

    made it to the Tabular Panda

    procs = [Categorify, FillMissing]
    
    to = TabularPandas(combined,
                       procs = procs,
                       cat_names = cat_cols)
    
    train_to_cat = to.items.iloc[:train.shape[0], :] # transformed cat for train
    test_to_cat = to.items.iloc[train.shape[0]:, :] # transformed cat for test. Need to separate them
    

    to.items will gave us transformed cat columns. After that, we need to assemble everything back together

    train_imp = pd.concat([train_to_cat, train[cont_cols]], 1) # assemble new cat and old cont together
    test_imp = pd.concat([test_to_cat, test[cont_cols[:-1]]], 1) # exclude SalePrice
    
    train_imp['SalePrice'] = np.log(train_imp['SalePrice']) # metric for kaggle
    

    After that, we do as per fastai tutorial.

    dep_var = 'SalePrice'
    procs = [Categorify, FillMissing, Normalize]
    splits = RandomSplitter(valid_pct=0.2)(range_of(train_imp))
    
    to = TabularPandas(train_imp, 
                       procs = procs,
                       cat_names = cat_cols,
                       cont_names = cont_cols[:-1], # we need to exclude target
                       y_names = 'SalePrice',
                       splits=splits)
    
    dls = to.dataloaders(bs=64)
    
    learn = tabular_learner(dls, n_out=1, loss_func=F.mse_loss)
    learn.lr_find()
    
    learn.fit_one_cycle(20, slice(1e-2, 1e-1), cbs=[ShowGraphCallback()])
    

    At this point, we have a learner but still can't predict. I thought after we do:

    dl = learn.dls.test_dl(test_imp, bs=64)
    preds, _ = learn.get_preds(dl=dl) # get prediction
    

    it would just work (preprocessing of cont values and predict), but no. It will not fillna. So just find and fill nans in test:

    missing = test_imp.isnull().sum().sort_values(ascending=False).head(12).index.tolist()
    for c in missing:
        test_imp[c] = test_imp[c].fillna(test_imp[c].median())
    

    after that we can finally predict:

    dl = learn.dls.test_dl(test_imp, bs=64)
    preds, _ = learn.get_preds(dl=dl) # get prediction
    
    final_preds = np.exp(preds.flatten()).tolist()
    
    sub = pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')
    sub.SalePrice = final_preds
    
    filename = 'submission.csv'
    sub.to_csv(filename, index=False)
    

    Apologies for the long narrative but I'm relatively new to coding and this problem was hard to point out. Very little info on how to solve it online. In short, it was a pain.

    Unfortunately, this is still a workaround to a problem. If the number of classes in any feature is different for test, it will freak out. Also strange it didn't fillna while fitting test to dls.

    Should you have any suggestions you are willing to share, please let me know.