python scikit-learn vectorization logistic-regression text-classification

Oversampling after splitting the dataset - Text classification

I am having some issues with the steps to follow for over-sampling a dataset. What I have done is the following:

# Separate input features and target
y_up = df.Label

X_up = df.drop(columns=['Date','Links', 'Paths'], axis=1)

# setting up testing and training sets

X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X_up, y_up, test_size=0.30, random_state=27)

class_0 = X_train_up[X_train_up.Label==0]
class_1 = X_train_up[X_train_up.Label==1]


# upsample minority
class_1_upsampled = resample(class_1,
                          replace=True, 
                          n_samples=len(class_0), 
                          random_state=27) #

# combine majority and upsampled minority
upsampled = pd.concat([class_0, class_1_upsampled])

Since my dataset looks like:

Label     Text 
1        bla bla bla
0        once upon a time 
1        some other sentences
1        a few sentences more
1        this is my dataset!

I applied a vectorizer to transform string into numbers:

X_train_up=upsampled[['Text']]
y_train_up=upsampled[['Label']]

X_train_up = pd.DataFrame(vectorizer.fit_transform(X_train_up['Text'].replace(np.NaN, "")).todense(), index=X_train_up.index)

Then I applied the logistic regression function:

upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up, y_train_up)

However, I have got the following error at this step:

X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)

pred_up_log = upsampled_log.predict(X_test_up)

ValueError: X has 3021 features per sample; expecting 5542

Since it was told me that I should apply the oversampling after splitting my dataset into train e test, I have not vectorised the test set. My doubts are then the following:

is it right to consider later a vectorisation of the test set: X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)
is it right to consider the over-sampling after splitting the dataset into training and test?

Alternatively, I tried with Smote function. The code below works, but I would prefer to consider also the oversampling, if possible, rather than SMOTE.

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df['Text'],df['Label'], test_size=0.2,random_state=42)

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train_up)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)


sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train_tfidf, y_train_up)
print("Shape after smote is:",X_train_res.shape,y_train_res.shape)

nb = Pipeline([('clf', LogisticRegression())])
nb.fit(X_train_res, y_train_res)
y_pred = nb.predict(count_vect.transform(X_test_up))
print(accuracy_score(y_test_up,y_pred))

Any comments and suggestions will be appreciated. Thanks

Solution

It is better to do the countVectorizing and transformation on the whole dataset, split into test and train, and keep it as a sparse matrix without converting back into a data.frame.

For example this is a dataset:

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

df = pd.DataFrame({'Text':['This is bill','This is mac','here’s an old saying',
                           'at least old','data scientist years','data science is data wrangling', 
                           'This rings particularly','true for data science leaders',
                           'who watch their data','scientists spend days',
                           'painstakingly picking apart','ossified corporate datasets',
                           'arcane Excel spreadsheets','Does data science really',
                           'they just delegate the job','Data Is More Than Just Numbers',
                           'The reason that',
                           'data wrangling is so difficult','data is more than text and numbers'],
                   'Label':[0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0]})

We perform the vectorization and transformation, followed by split:

count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])
tfidf_transformer = TfidfTransformer()
df_tfidf = tfidf_transformer.fit_transform(df_counts)

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_tfidf,df['Label'].values, 
                                                              test_size=0.2,random_state=42)

Up sampling can be done by resampling the index of the minority classes:

class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
                        np.random.choice(class_1,len(class_0),replace=True)
                       ))

upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up[up_idx,:], y_train_up[up_idx])

And the prediction will work:

upsampled_log.predict(X_test_up)
array([0, 1, 0, 0])

If you have concerns about data leakage, that is some of the information from test actually goes into the training, through the use of TfidfTransformer(). Honestly yet to see concrete proof or demonstration of this, but below is an alternative where you apply the tfid separately:

count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_counts,df['Label'].values, 
                                                              test_size=0.2,random_state=42)

class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
                        np.random.choice(class_1,len(class_0),replace=True)
                       ))

tfidf_transformer = TfidfTransformer()
upsample_Xtrain = tfidf_transformer.fit_transform(X_train_up[up_idx,:])
upsamle_y = y_train_up[up_idx]

upsampled_log = LogisticRegression(solver='liblinear').fit(upsample_Xtrain,upsamle_y)

X_test_up = tfidf_transformer.transform(X_test_up)
upsampled_log.predict(X_test_up)