Search code examples
python-3.xsizepicklesklearn-pandastfidfvectorizer

Reduce Pickle size TfidfVectorizer


I need to standar some parameters to build vectors based on text. That is why I am trying to pickle a TfidVectorizer from a group of text documents. Based on those parameters I need to vectorize new text documents and their features and weight criteria should be the same of the previous documents.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(
        strip_accents = 'ascii', sublinear_tf=True, min_df=5, norm='l2',
        encoding='latin-1', ngram_range=(1, 2), stop_words=spanish_stopwords,
        token_pattern = r'\w+[a-z,ñ]')
features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()

features.shape

(617, 22997)

import pickle
pickle.dump(tfidf, open("vectorizer3.pickle", "wb"))

vectorizer3.pickle size is 76.2MB. Is there a way to reduce this to 10MB?


Solution

  • Try using gzip

    import gzip
    import pickle
    
    # writing into file. This will take long time
    fp = gzip.open('tfidf.data','wb')
    pickle.dump(tfidf,fp)
    fp.close()
    
    # read the file
    fp = gzip.open('primes.data','rb') #This assumes that tfidf.data is already packed with gzip
    tfidf = pickle.load(fp)
    fp.close()
    

    This method may not guarantee you in reducing the file size to < 10MB. But definitely, it will reduce the size of the pickle file