I have to discretize into at least 5 bins a continuous target variable in order to lower the complexity of a classification model using the sklearn library
In order to do this, I've used the KBinsDiscretizer but I don't know how can I split in balanced parts the dataset now that I've discretized the target variable. This is my code:
X = df.copy()
y = X.pop('shares')
# scaling the dataset so all data in the same range
scaler = preprocessing.MinMaxScaler()
X = scaler.fit_transform(X)
discretizer = preprocessing.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
y_discretized = discretizer.fit_transform(y.values.reshape(-1, 1))
# is this correct?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True, stratify=y_discretized)
For completeness, I'm trying to recreate a less complex model than the one showed in: [1] K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal
Your y_train
and y_test
are parts of y
, which has (it seems) the original continuous values. So you're ending up fitting multiclass classification models, with probably lots of different classes, which likely causes the crashes.
I assume what you wanted is
X_train, X_test, y_train, y_test = train_test_split(X, y_discretized, test_size=0.33, shuffle=True, stratify=y_discretized)
Whether discretizing a continuous target to turn a regression into a classification is a topic for another site, see e.g. https://datascience.stackexchange.com/q/90297/55122