Search code examples
scipyscikit-learnscikits

truncated svd on tf idf gives value error array is too big


I am trying to apply TruncatedSVD.fit_transform() on sparse matrix given by TfidfVectorizer in scikit-learn which gives :

    tsv = TruncatedSVD(n_components=10000,algorithm='randomized',n_iterations=5)
    tfv = TfidfVectorizer(min_df=3,max_features=None,strip_accents='unicode',analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1)
    tfv.fit(text)
    text = tfv.transform(text)
    tsv.fit(text)

Value error : array is too big

What are the other approaches which I can use or dimensionality reduction.


Solution

  • I am pretty sure that the problem is:

    tsv = TruncatedSVD(n_components=10000...
    

    You have 10000 components in your SVD. If you have an m x n data matrix, SVD will have matrices with dimensions m x n_components and n_components x n, and these will be dense, even if the data was sparse. Those matrices are probably too big.

    I copied your code and ran it on Kaggle Hashtag data(which is what I think this is from), and at 300 components, python was using up to 1GB. At 10000, you'd use about 30 times that.

    Incidentally, what you are doing here is latent semantic analysis, and that isn't likely to benefit from this many components. Somewhere in the range of 50-300 should capture everything that matters.