Search code examples
pythonnlpsparse-matrixtext-classificationtf-idf

NLP classification with sparse and numerical features crashes


I have a dataset of 10 million english shows, which has been cleaned and lemmatized, and their classification into different category types such as comedy, documentary, action, ... etc

I also have a feature called duration, which is the length of the tv show.

Data can be found here

I perform tfidf vectorization on the titles, which returns a sparse matrix and normalization on the duration column.

Then I want to feed the data to a logistic regression classifier.

side question: I want to know if theres a better way to handle combining a sparse matrix and a numerical column

when I try to do it using todense() or toarray(), It works

When i pass it to the logistic regression function, the notebook crashes. But if i dont have the duration col, which means i dont have to apply the toarray() or todense() function, it works perfectly. Is this a memory issue?

This is my code:

import os

import pandas as pd

from sklearn import metrics
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

def normalize(df, col = ''):
    mms = MinMaxScaler()
    mms_col = mms.fit_transform(df[[col]])
    return mms_col

def tfidf(X, col = ''):
    tfidf_vectorizer = TfidfVectorizer(max_df = 0.8, max_features = 10000)
    return tfidf_vectorizer.fit_transform(X[col])

def get_training_data(df):
    df = shuffle(pd.read_csv(df).dropna())
    data = df[['name_title', 'Duration']]

    X_duration = normalize(data, col = 'Duration')
    X_sparse = tfidf(data, col = 'name_title')
    X = pd.DataFrame(X_sparse.toarray())

    X['Duration'] = X_duration
    y = df['target']

    return X, y

def logistic_regression(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    lr = LogisticRegression(C = 100.0, random_state = 1, solver = 'lbfgs', multi_class = 'ovr')
    lr.fit(X_train, y_train)
    y_predict = lr.predict(X_test)
    print(y_predict)
    print("Logistic Regression Accuracy %.3f" %metrics.accuracy_score(y_test, y_predict))

data_path = '../data/'
X, y = get_training_data(os.path.join(data_path, 'podcasts_en_processed.csv'))
print(X.shape) # this prints (971426, 10001)
logistic_regression(X, y)

Solution

  • It seems like you're encountering a memory issue when combining a large sparse matrix from TF-IDF vectorization with a dense 'duration' feature. Converting a sparse matrix to a dense one with toarray() or todense() dramatically increases memory usage, which is likely causing the crash.

    Instead of converting the entire sparse matrix, try combining the sparse TF-IDF features with the dense 'duration' feature while keeping most of the data in sparse format. Use scipy.sparse.hstack for this:

    from scipy.sparse import hstack
    
    # Combine the sparse and dense features
    X = hstack([X_sparse, X_duration])
    

    This method maintains the efficiency of sparse data storage. If you're still facing memory issues, consider reducing the number of features in your TF-IDF vectorization [ tfidf_vectorizer = TfidfVectorizer(max_df = 0.8, max_features = 10000) , I think 10000 is a bit too much ] or using incremental learning methods like SGDClassifier with a logistic regression loss. These approaches should help manage the large dataset more effectively.