Search code examples
pythonscikit-learntoarray

Using sklearn's toarray method results in the use of all RAM


In the following code on Google Colab when it reaches to the toarray method, it uses all the RAM. I looked for an answer and it's been suggested the use of HashingVectorizer. How can I implement it in the following code?

The shape of cv.fit_transform(data_list) is (324430, 351550)

# Loading the dataset
data = pd.read_csv("Language Detection.csv")
# value count for each language
data["Language"].value_counts()
# separating the independent and dependant features
X = data["Text"]
y = data["Language"]
# converting categorical variables to numerical
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
# creating a list for appending the preprocessed text
data_list = []
# iterating through all the text
for text in X:
    # removing the symbols and numbers
    text = re.sub(r'[!@#$(),n"%^*?:;~`0-9]', ' ', text)
    text = re.sub(r'[[]]', ' ', text)
    # converting the text to lower case
    text = text.lower()
    # appending to data_list
    data_list.append(text)
# creating bag of words using countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(data_list).toarray()
#train test splitting
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
#model creation and prediction
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)

Solution

  • Just don't use toarray. The output of the count vectorizer is a sparse matrix, which MultinomialNB should handle fine it seems.

    If you really want to use hashing, you should just be able to replace CountVectorizer by HashingVectorizer.