I'm trying to replicate an application described in this paper (section 4.1), where Sparse Principal Component Analysis is applied to a text corpus with the output being K principal components, each displaying a 'structure that is otherwise hidden'. In other words, the principal components should each contain a list of words, all of which share a common theme.
I have used sklearn's MiniBatchSparsePCA package to try to replicate the application, though my output is a matrix of zeros.
My data comes from a survey which was cleaned in Stata. It is a vector of 386 answers; which are sentences.
My Attempt
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import decomposition
# Data comes from a survey, which was cleaned using Stata.
data_source = "/Users/****/q19_free_text.dta"
raw_data = pd.read_stata(data_source) #Reading in the data from a Stata file.
text_data = raw_data.iloc[:,1] #Cleaning out Observation ID number.
text_data.shape # Out[268]: (368, ) - There are 368 text (sentence) answers.
# Term Frequency – Inverse Document- Word Frequency
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english')
X_train = vectorizer.fit_transform(text_data)
spca = decomposition.MiniBatchSparsePCA(n_components=2, alpha=0.5)
#TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
X_train2 = X_train.toarray() #Trying with a dense array...
components = spca.components_
print(components) #Out: [[ 0. 0. 0. ..., 0. 0. 0.]
# [ 0. 0. 0. ..., 0. 0. 0.]]
components.shape #Out: (2, 916)
# Empty output!
Other Notes
I used these sources to write the above code:
(...) to do something similar to that which is done in section 4.1 in the paper linked. There they 'summarize' a text corpus by using SPCA and the output is K components, where each component is a list of words (or, features).
If I understand you correctly, you ask how to retrieve words for the components.
You can do this by retrieving indices of nonzero entries in components (use appropriate numpy
code on components
). Then using vectorizer.vocabulary_
you can find out which indices (words/tokens) are found in your components.
See this notebook for an example implementation (I used 20 newsgroups dataset).