Search code examples
pythonpython-3.xpandasscikit-learncountvectorizer

Python Access Labels of Sklearn CountVectorizer


Here is my df after cleaning:

    number  summary             cleanSummary
0   1-123   he loves ice cream  love ice cream
1   1-234   she loves ice       love ice
2   1-345   i hate avocado      hate avocado
3   1-123   i like skim milk    like skim milk

As you can see, there are two records that have the same number. Now I'll create and fit the vectorizer.

cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", ngram_range=(1,1), analyzer='word')
cv.fit(df['cleanSummary'])

Now I'll transform.

freq = cv.transform(df['cleanSummary'])

Now if I take a look at freq...

freq = sum(freq).toarray()[0]
freq = pd.DataFrame(freq, columns=['frequency'])
freq

    frequency
0   1
1   1
2   1
3   2
4   1
5   2
6   1
7   1

...there doesn't seem to be a logical way to access the original number. I have tried methods of looping through each row, but this runs into problems because of the potential for multiple summaries per number. A loop using a grouped df...

def extractFeatures(groupedDF, textCol):
    features = pd.DataFrame()
    for id, group in groupedDF:
           freq = cv.transform(group[textCol])
           freq = sum(freq).toarray()[0]
           freq = pd.DataFrame(freq, columns=['frequency'])
           dfinner = pd.DataFrame(cv.get_feature_names(), columns=['ngram'])
           dfinner['number'] = id
           dfinner = dfinner.join(freq)
           features = features.append(dfinner)
    return features

...works, but the performance is terrible (i.e. 12 hours to run through 45,000 documents with one sentence lengths).

If I change

freq = sum(freq).toarray()[0]

to

freq = freq.toarray()

I get an array of frequencies for each ngram for each document. This is good, but then it doesn't allow me to push that array of lists into a dataframe. And I still wouldn't be able to access nunmber.

How do I access the original labels number for each ngram without looping over a grouped df? My desired result is:

number    ngram    frequency
1-123     love     1
1-123     ice      1
1-123     cream    1
1-234     love     1
1-234     ice      1
1-345     hate     1 
1-345     avocado  1
1-123     like     1  
1-123     skim     1 
1-123     milk     1

Edit: this is somewhat of a revisit to this question:Convert CountVectorizer and TfidfTransformer Sparse Matrices into Separate Pandas Dataframe Rows. However, after implementing the method described in that answer, I face memory issues for a large corpus, so it doesn't seem scalable.


Solution

  • freq = cv.fit_transform(df.cleanSummary)
    dtm = pd.DataFrame(freq.toarray(), columns=cv.get_feature_names(), index=df.number).stack()
    dtm[dtm > 0]
    
    number         
    1-123   cream      1
            ice        1
            love       1
    1-234   ice        1
            love       1
    1-345   avocado    1
            hate       1
    1-123   like       1
            milk       1
            skim       1
    dtype: int64