Search code examples
nltkn-gramsklearn-pandascountvectorizer

How to apply countvectorizer to bigrams in a pandas dataframe


I'm trying to apply the countvectorizer to a dataframe containing bigrams to convert it into a frequency matrix showing the number of times each bigram appears in each row but I keep getting error messages.

This is what I tried using

cereal['bigrams'].head()

0    [(best, thing), (thing, I), (I, have),....
1    [(eat, it), (it, every), (every, morning),...
2    [(every, morning), (morning, my), (my, brother),...
3    [(I, have), (five, cartons), (cartons, lying),...
.........
bow = CountVectorizer(max_features=5000, ngram_range=(2,2))
train_bow = bow.fit_transform(cereal['bigrams'])
train_bow

Expected results


      (best,thing) (thing, I) (I, have)  (eat,it) (every,morning)....
0           1          1          1         0           0
1           0          0          0         1           1
2           0          0          0         0           1
3           0          0          1         0           0
....




Solution

  • I see you are trying to convert a pd.Series into a count representation of each term.

    Thats a bit different from what CountVectorizer does;

    From the function description:

    Convert a collection of text documents to a matrix of token counts

    The official example of case use is:

    >>> from sklearn.feature_extraction.text import CountVectorizer
    >>> corpus = [
    ...     'This is the first document.',
    ...     'This document is the second document.',
    ...     'And this is the third one.',
    ...     'Is this the first document?',
    ... ]
    >>> vectorizer = CountVectorizer()
    >>> X = vectorizer.fit_transform(corpus)
    >>> print(vectorizer.get_feature_names())
    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
    >>> print(X.toarray())  
    [[0 1 1 1 0 0 1 0 1]
     [0 2 0 1 0 1 1 0 1]
     [1 0 0 1 1 0 1 1 1]
     [0 1 1 1 0 0 1 0 1]]
    

    So, as one can see, it takes as input a list where each term is a "document". Thats problaby the cause of the errors you are getting, you see, you are passing a pd.Series where each term is a list of tuples.

    For you to use CountVectorizer you would have to transform your input into the proper format.

    If you have the original corpus/text you can easily implement CountVectorizer on top of it (with the ngram parameter) to get the desired result.

    Else, best solution wld be to treat it as it is, a series with a list of items, which must be counted/pivoted.

    Sample workaround:

    enter image description here

    (it wld be a lot easier if you just use the text corpus instead)

    Hope it helps!