python scikit-learn nltk countvectorizer

Combining CountVectorizer and ngrams in Python

Have a task to classify male and female names, using ngrams. So, have a dataframe like:

    name    is_male
Dorian      1
Jerzy       1
Deane       1
Doti        0
Betteann    0
Donella     0

The specific requarement is to use

from nltk.util import ngrams

for this task, to create ngrams (n=2,3,4)

I made a list of names, then used ngrams:

from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

test_ngrams = []
for name in name_list:
    test_ngrams.append(list(ngrams(name,3)))

Now I need to somehow vectorize all this to use for classification, I try

X_train = count_vect.fit_transform(test_ngrams)

Recieve:

AttributeError: 'list' object has no attribute 'lower'

I understand that a list is wrong type of input here, can someone please explain how I should do it, so I later can use MultinomialNB, for example. Am I doing it the right way at all? Thanks in advance!

Solution

Your are passing a sequence of lists to the vectorizer which is why you are receiving the AttributeError. Instead, you should pass an iterable of strings. From the CountVectorizer documentation:

fit_transform(raw_documents, y=None)

Learn the vocabulary dictionary and return term-document matrix.

This is equivalent to fit followed by transform, but more efficiently implemented.

Parameters: raw_documents : iterable

An iterable which yields either str, unicode or file objects.

To answer your question, the CountVectorizer is capable of creating N-grams by using ngram_range (the following produces bigrams):

count_vect = CountVectorizer(ngram_range=(2,2))

corpus = [
    'This is the first document.',
    'This is the second second document.',
]
X = count_vect.fit_transform(corpus)

print(count_vect.get_feature_names())
['first document', 'is the', 'second document', 'second second', 'the first', 'the second', 'this is']

Update:

Since you mentioned that you have to generate ngrams using NLTK, we need to override parts of the default behaviour of the CountVectorizer. Namely, the analyzer which converts raw strings into features:

analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable

[...]

If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.

Since we already provide ngrams, an identity function suffices:

count_vect = CountVectorizer(
    analyzer=lambda x:x
)

Complete Example combining NLTK ngrams and CountVectorizer:

corpus = [
    'This is the first document.',
    'This is the second second document.',
]

def build_ngrams(text, n=2):
    tokens = text.lower().split()
    return list(nltk.ngrams(tokens, n))

corpus = [build_ngrams(document) for document in corpus]

count_vect = CountVectorizer(
    analyzer=lambda x:x
)

X = count_vect.fit_transform(corpus)
print(count_vect.get_feature_names())
[('first', 'document.'), ('is', 'the'), ('second', 'document.'), ('second', 'second'), ('the', 'first'), ('the', 'second'), ('this', 'is')]