machine-learning scikit-learn text-mining pca sklearn-pandas

MiniBatchSparsePCA on Text Data

Goal

I'm trying to replicate an application described in this paper (section 4.1), where Sparse Principal Component Analysis is applied to a text corpus with the output being K principal components, each displaying a 'structure that is otherwise hidden'. In other words, the principal components should each contain a list of words, all of which share a common theme.

I have used sklearn's MiniBatchSparsePCA package to try to replicate the application, though my output is a matrix of zeros.

Data
My data comes from a survey which was cleaned in Stata. It is a vector of 386 answers; which are sentences.

My Attempt

# IMPORT LIBRARIES #
####################################
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import decomposition
####################################

# USE SKLEARN TO IMPORT STATA DATA. #
# Data comes from a survey, which was cleaned using Stata.

####################################
data_source = "/Users/****/q19_free_text.dta"
raw_data = pd.read_stata(data_source) #Reading in the data from a Stata file.  
text_data = raw_data.iloc[:,1] #Cleaning out Observation ID number.
text_data.shape     # Out[268]: (368, ) - There are 368 text (sentence) answers.
####################################

# Term Frequency – Inverse Document- Word Frequency
####################################
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english')
X_train = vectorizer.fit_transform(text_data)

spca = decomposition.MiniBatchSparsePCA(n_components=2, alpha=0.5)
spca.fit(X_train) 
#TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

X_train2 = X_train.toarray() #Trying with a dense array...
spca.fit(X_train2)

components = spca.components_


print(components)  #Out: [[ 0.  0.  0. ...,  0.  0.  0.]
                   #     [ 0.  0.  0. ...,  0.  0.  0.]]

components.shape   #Out: (2, 916)

# Empty output!

Other Notes

I used these sources to write the above code:

Official Example

Vectorising Text data

Previous question on the same problem

Solution

(...) to do something similar to that which is done in section 4.1 in the paper linked. There they 'summarize' a text corpus by using SPCA and the output is K components, where each component is a list of words (or, features).

If I understand you correctly, you ask how to retrieve words for the components.

You can do this by retrieving indices of nonzero entries in components (use appropriate numpy code on components). Then using vectorizer.vocabulary_ you can find out which indices (words/tokens) are found in your components.

See this notebook for an example implementation (I used 20 newsgroups dataset).