Search code examples
pythonpython-3.xpandascatboost

CatBoost border


I can't start catboost learning with catboost because of a small border.

X = pandas.read_csv("../input/x_y_test/X.csv")
X_test = pandas.read_csv("../input/x_y_test/X_test.csv")
y = pandas.read_csv("../input/y-data/y.csv")

X = X.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)
y = y.reset_index(drop = True)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = .3, random_state = 1337)

X_train = X_train.reset_index(drop = True) 
X_val = X_val.reset_index(drop = True)
y_train = y_train.reset_index(drop = True)
y_val = y_val.reset_index(drop = True)

model_cb = CatBoostClassifier(eval_metric = "Accuracy", n_estimators = 1200, random_seed = 70)
model_cb.fit(X_train, y_train, eval_set = (X_val, y_val), use_best_model = True)

so I got

CatboostError: catboost/libs/metrics/metric.cpp:3929: All train targets are greater than border 0.5

data

https://drive.google.com/drive/folders/1m7bNIs0mZQQkAsvkETB3n6j62p9QJX39?usp=sharing


Solution

  • Your main error is that you're feeding y_train to your algo as:

        id  skilled
    0   138177  0
    1   36214   0
    2   103206  1
    3   22699   1
    4   96145   1
    

    I believe what you really intended was just y_train.skilled

    Run reassignment like below before your fitting and you're fine to go:

    y_train = y_train.skilled # just skill is enough 
    y_val = y_val.skilled # just skill is enough
    
    model_cb = CatBoostClassifier(eval_metric = "Accuracy", n_estimators = 1200, random_seed = 70)
    model_cb.fit(X_train, y_train, eval_set = (X_val, y_val), use_best_model = True)
    

    On a side note, do you really believe id in X_train possesses any predictive ability. Why not drop it from features as well?