Question: What is the best way to convert sparse matrices resulting from sklearn's CountVectorizer and TfidfTransformer into Pandas DataFrame columns with a separate row for each bigram and its corresponding frequency and tf-idf score?
Pipeline: Bring in text data from a SQL DB, split text into bigrams and calculate the frequency per document and the tf-idf per bigram per document, load the results back into the SQL DB.
Current State:
Two columns of data are brought in (number
, text
). text
is cleaned to produce a third column cleanText
:
number text cleanText
0 123 The farmer plants grain farmer plants grain
1 234 The farmer and his son go fishing farmer son go fishing
2 345 The fisher catches tuna fisher catches tuna
This DataFrame is fed into sklearn's feature extraction:
cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(data.cleanText)
tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)
Then the matrices are fed back into the original DataFrame after converting them to an array:
data['frequency'] = list(dt_mat.toarray())
data['tfidf_score']=list(tfidf_mat.toarray())
Output:
number text cleanText \
0 123 The farmer plants grain farmer plants grain
1 234 The farmer and his son go fishing farmer son go fishing
2 345 The fisher catches tuna fisher catches tuna
frequency tfidf_score
0 [0, 1, 0, 0, 0, 1, 0] [0.0, 0.707106781187, 0.0, 0.0, 0.0, 0.7071067...
1 [0, 0, 1, 0, 1, 0, 1] [0.0, 0.0, 0.57735026919, 0.0, 0.57735026919, ...
2 [1, 0, 0, 1, 0, 0, 0] [0.707106781187, 0.0, 0.0, 0.707106781187, 0.0...
Problems:
frequency
and tfidf_score
are not on separate lines for each bigramDesired Output:
number bigram frequency tfidf_score
0 123 farmer plants 1 0.70
0 123 plants grain 1 0.56
1 234 farmer son 1 0.72
1 234 son go 1 0.63
1 234 go fishing 1 0.34
2 345 fisher catches 1 0.43
2 345 catches tuna 1 0.43
I managed to get one of the numeric columns assigned to separate rows of the DataFrame with this code:
data.reset_index(inplace=True)
rows = []
_ = data.apply(lambda row: [rows.append([row['number'], nn])
for nn in row.tfidf_score], axis=1)
df_new = pd.DataFrame(rows, columns=['number', 'tfidf_score'])
Output:
number tfidf_score
0 123 0.000000
1 123 0.707107
2 123 0.000000
3 123 0.000000
4 123 0.000000
5 123 0.707107
6 123 0.000000
7 234 0.000000
8 234 0.000000
9 234 0.577350
10 234 0.000000
11 234 0.577350
12 234 0.000000
13 234 0.577350
14 345 0.707107
15 345 0.000000
16 345 0.000000
17 345 0.707107
18 345 0.000000
19 345 0.000000
20 345 0.000000
However, I am unsure how to do this for both numeric columns, and this doesn't bring in the bigrams (feature names) themselves. Also, this method requires an array (which is why I converted the sparse matrices to arrays in the first place), and I would like to avoid this if possible due to performance issues and the fact that I would then have to strip the meaningless rows.
Any insight is greatly appreciated! Thank you very much for taking the time to read this question - I apologize for the length. Please let me know if there's anything I can do to improve the question or clarify my process.
The bigram names can be captured using CountVectorizer
's get_feature_names()
. From there it's just a series of melt
and merge
operations:
print(data)
number text cleanText
0 123 The farmer plants grain farmer plants grain
1 234 The farmer and his son go fishing farmer son go fishing
2 345 The fisher catches tuna fisher catches tuna
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(data.cleanText)
tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)
The CountVectorizer
feature names are, in this case, the bigrams:
print(cv.get_feature_names())
[u'catches tuna',
u'farmer plants',
u'farmer son',
u'fisher catches',
u'go fishing',
u'plants grain',
u'son go']
CountVectorizer.fit_transform()
returns a sparse matrix. We can convert it to a dense representation, wrap it in a DataFrame
, and then tack on the feature names as columns:
bigrams = pd.DataFrame(dt_mat.todense(), index=data.index, columns=cv.get_feature_names())
bigrams['number'] = data.number
print(bigrams)
catches tuna farmer plants farmer son fisher catches go fishing \
0 0 1 0 0 0
1 0 0 1 0 1
2 1 0 0 1 0
plants grain son go number
0 1 0 123
1 0 1 234
2 0 0 345
To go from wide to long format, use melt()
.
Then restrict the results to bigram matches (query()
is useful here):
bigrams_long = (pd.melt(bigrams.reset_index(),
id_vars=['index','number'],
value_name='bigram_ct')
.query('bigram_ct > 0')
.sort_values(['index','number']))
index number variable bigram_ct
3 0 123 farmer plants 1
15 0 123 plants grain 1
7 1 234 farmer son 1
13 1 234 go fishing 1
19 1 234 son go 1
2 2 345 catches tuna 1
11 2 345 fisher catches 1
Now repeat the process for tfidf
:
tfidf = pd.DataFrame(tfidf_mat.todense(), index=data.index, columns=cv.get_feature_names())
tfidf['number'] = data.number
tfidf_long = pd.melt(tfidf.reset_index(),
id_vars=['index','number'],
value_name='tfidf').query('tfidf > 0')
Finally, merge bigrams
and tfidf
:
fulldf = (bigrams_long.merge(tfidf_long,
on=['index','number','variable'])
.set_index('index'))
number variable bigram_ct tfidf
index
0 123 farmer plants 1 0.707107
0 123 plants grain 1 0.707107
1 234 farmer son 1 0.577350
1 234 go fishing 1 0.577350
1 234 son go 1 0.577350
2 345 catches tuna 1 0.707107
2 345 fisher catches 1 0.707107