Have a task to classify male and female names, using ngrams. So, have a dataframe like:
name is_male
Dorian 1
Jerzy 1
Deane 1
Doti 0
Betteann 0
Donella 0
The specific requarement is to use
from nltk.util import ngrams
for this task, to create ngrams (n=2,3,4)
I made a list of names, then used ngrams:
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
test_ngrams = []
for name in name_list:
test_ngrams.append(list(ngrams(name,3)))
Now I need to somehow vectorize all this to use for classification, I try
X_train = count_vect.fit_transform(test_ngrams)
Recieve:
AttributeError: 'list' object has no attribute 'lower'
I understand that a list is wrong type of input here, can someone please explain how I should do it, so I later can use MultinomialNB, for example. Am I doing it the right way at all? Thanks in advance!
Your are passing a sequence of lists to the vectorizer which is why you are receiving the AttributeError
. Instead, you should pass an iterable of strings. From the CountVectorizer
documentation:
fit_transform(raw_documents, y=None)
Learn the vocabulary dictionary and return term-document matrix.
This is equivalent to fit followed by transform, but more efficiently implemented.
Parameters: raw_documents : iterable
An iterable which yields either str, unicode or file objects.
To answer your question, the CountVectorizer
is capable of creating N-grams by using ngram_range
(the following produces bigrams):
count_vect = CountVectorizer(ngram_range=(2,2))
corpus = [
'This is the first document.',
'This is the second second document.',
]
X = count_vect.fit_transform(corpus)
print(count_vect.get_feature_names())
['first document', 'is the', 'second document', 'second second', 'the first', 'the second', 'this is']
Update:
Since you mentioned that you have to generate ngrams using NLTK, we need to override parts of the default behaviour of the CountVectorizer
. Namely, the analyzer
which converts raw strings into features:
analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable
[...]
If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
Since we already provide ngrams, an identity function suffices:
count_vect = CountVectorizer(
analyzer=lambda x:x
)
Complete Example combining NLTK ngrams and CountVectorizer:
corpus = [
'This is the first document.',
'This is the second second document.',
]
def build_ngrams(text, n=2):
tokens = text.lower().split()
return list(nltk.ngrams(tokens, n))
corpus = [build_ngrams(document) for document in corpus]
count_vect = CountVectorizer(
analyzer=lambda x:x
)
X = count_vect.fit_transform(corpus)
print(count_vect.get_feature_names())
[('first', 'document.'), ('is', 'the'), ('second', 'document.'), ('second', 'second'), ('the', 'first'), ('the', 'second'), ('this', 'is')]