Here is my df after cleaning:
number summary cleanSummary
0 1-123 he loves ice cream love ice cream
1 1-234 she loves ice love ice
2 1-345 i hate avocado hate avocado
3 1-123 i like skim milk like skim milk
As you can see, there are two records that have the same number
. Now I'll create and fit the vectorizer.
cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", ngram_range=(1,1), analyzer='word')
cv.fit(df['cleanSummary'])
Now I'll transform.
freq = cv.transform(df['cleanSummary'])
Now if I take a look at freq
...
freq = sum(freq).toarray()[0]
freq = pd.DataFrame(freq, columns=['frequency'])
freq
frequency
0 1
1 1
2 1
3 2
4 1
5 2
6 1
7 1
...there doesn't seem to be a logical way to access the original number
. I have tried methods of looping through each row, but this runs into problems because of the potential for multiple summaries per number
. A loop using a grouped df...
def extractFeatures(groupedDF, textCol):
features = pd.DataFrame()
for id, group in groupedDF:
freq = cv.transform(group[textCol])
freq = sum(freq).toarray()[0]
freq = pd.DataFrame(freq, columns=['frequency'])
dfinner = pd.DataFrame(cv.get_feature_names(), columns=['ngram'])
dfinner['number'] = id
dfinner = dfinner.join(freq)
features = features.append(dfinner)
return features
...works, but the performance is terrible (i.e. 12 hours to run through 45,000 documents with one sentence lengths).
If I change
freq = sum(freq).toarray()[0]
to
freq = freq.toarray()
I get an array of frequencies for each ngram for each document. This is good, but then it doesn't allow me to push that array of lists into a dataframe. And I still wouldn't be able to access nunmber
.
How do I access the original labels number
for each ngram without looping over a grouped df? My desired result is:
number ngram frequency
1-123 love 1
1-123 ice 1
1-123 cream 1
1-234 love 1
1-234 ice 1
1-345 hate 1
1-345 avocado 1
1-123 like 1
1-123 skim 1
1-123 milk 1
Edit: this is somewhat of a revisit to this question:Convert CountVectorizer and TfidfTransformer Sparse Matrices into Separate Pandas Dataframe Rows. However, after implementing the method described in that answer, I face memory issues for a large corpus, so it doesn't seem scalable.
freq = cv.fit_transform(df.cleanSummary)
dtm = pd.DataFrame(freq.toarray(), columns=cv.get_feature_names(), index=df.number).stack()
dtm[dtm > 0]
number
1-123 cream 1
ice 1
love 1
1-234 ice 1
love 1
1-345 avocado 1
hate 1
1-123 like 1
milk 1
skim 1
dtype: int64