Search code examples
regressiondata-sciencepredictioncatboostcatboostregressor

Sales Prediction error on catboost regression/CatBoostRegressor


d = {'customer':['A','B','C','A'],'season':[1,2,3,4],
'cat1': ['BAGS','TSHIRT','DRESS','BELT'],
'cat2': ['high','low','high','medium'],'sale': [10,20,15,50]}
df = pd.DataFrame(data=d)
df

Desired output on season 5

d = {'customer':['A','B','C','A'],'season': [5,5,5,5],
'cat1': ['BAGS','TSHIRT','DRESS','BELT'],
'cat2': ['high','low','high','medium'],'sale': [?,?,?,?]}
df = pd.DataFrame(data=d)
df

I tried

df=df.groupby(['customer','season','cat1','cat2'])['Sales'].sum().sort_values(ascending=False).reset_index()
from sklearn.model_selection import train_test_split
X=df[['customer','season','cat1','cat2']]
y=df[['Sales']]

X.season=X.season.astype(float)
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size = 0.90, random_state =42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size = 0.85, random_state =42)
categorical_features_indices = np.where(X.dtypes != np.float)[0]
import catboost
from catboost import MetricVisualizer, Pool, CatBoostRegressor, cv
train_pool = Pool(data=X_train, label=y_train, cat_features=categorical_features_indices)
val_pool = Pool(data=X_val, label=y_val, cat_features=categorical_features_indices)
test_pool = Pool(data=X_test, label=y_test, cat_features=categorical_features_indices)


params = {
   'iterations':900,
   'loss_function': 'RMSE',
   'learning_rate': 0.0109, #1 0.102,
   'depth': 6,
   'l2_leaf_reg': 6,
   
   'border_count': 7,
   'thread_count': 7,
   
   'bagging_temperature': 2,
   'random_strength': 2.23,
   'colsample_bylevel': 0.85,
   
   'custom_metric': ['MAPE', 'R2'], 
   'eval_metric': 'R2', 
   'random_seed': 41,
   
   'max_ctr_complexity': 2,
   'logging_level': 'Silent',
   'use_best_model':False # Takes
}


reg_model = CatBoostRegressor(**params)
reg_model.fit(train_pool, eval_set=val_pool, plot=True, verbose=100)



X['season']=5
X['Predict_sales']=reg_model.predict(X)

The above code throws no error.

My Question is: My predict values doesn't change if input 5,6,7,8 however season is a continuous value. What am I doing wrong and how can i predict for season 6, 7, 8 and so on.


Solution

  • catboost is a tree-based model. Regression trees (as well as decision trees) partition the feature space and each partition yields the same value. Since neither season 5,6,7 or 8 occurred in the training data it should all land in the same partition and hence yielding exactly the same value.
    You might need to go for another model type (e.g. linear regression). What kind of relationship would you expect between season and sales? Predicting on something you haven't seen in your training data always is hard (except if there is something like a linear relationship)