Search code examples
pythonscikit-learnjupyter-notebookclassificationjoblib

Discretize continuous target variable using sklearn


I have to discretize into at least 5 bins a continuous target variable in order to lower the complexity of a classification model using the sklearn library

In order to do this, I've used the KBinsDiscretizer but I don't know how can I split in balanced parts the dataset now that I've discretized the target variable. This is my code:

X = df.copy()
y = X.pop('shares') 

# scaling the dataset so all data in the same range
scaler = preprocessing.MinMaxScaler()
X = scaler.fit_transform(X)

discretizer = preprocessing.KBinsDiscretizer(n_bins=5,  encode='ordinal', strategy='uniform')
y_discretized = discretizer.fit_transform(y.values.reshape(-1, 1))

# is this correct?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True, stratify=y_discretized) 

For completeness, I'm trying to recreate a less complex model than the one showed in: [1] K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal


Solution

  • Your y_train and y_test are parts of y, which has (it seems) the original continuous values. So you're ending up fitting multiclass classification models, with probably lots of different classes, which likely causes the crashes.

    I assume what you wanted is

    X_train, X_test, y_train, y_test = train_test_split(X, y_discretized, test_size=0.33, shuffle=True, stratify=y_discretized)
    

    Whether discretizing a continuous target to turn a regression into a classification is a topic for another site, see e.g. https://datascience.stackexchange.com/q/90297/55122