I have integer-type features in my feature vector that NLTK’s NaiveBayesClassifier
is treating as nominal values.
I am trying to build a language classifier using n-grams. For instance, the bigram ‘th’ is more common in English than French.
For each sentence in my training set, I extract a feature as follows: bigram(th): 5
where 5 (example) represents the number of times the bigram ‘th’ appeared in the sentence.
When I try building a classifier with features like this and I check the most informative features, I realize that the classifier does not realize that such features are linear. For example, it might consider bigram(ea): 4
as French, bigram(ea): 5
as English and bigram(ea): 6
as French again. This is quite arbitrary and does not represent the logic that a bigram is either more common in English or in French. This is why I need the integers to be treated as such.
Of course, I could replace these features with features such as has(th): True
. However, I believe this is a bad idea because both a French sentence with 1 instance of 'th' and an English sentence with 5 instances of 'th' will have the feature has(th): True
which cannot differentiate them.
I also found this relevant link but it did not provide me with the answer.
My feature extractor looks like this:
def get_ngrams(word, n):
ngrams_list = []
ngrams_list.append(list(ngrams(word, n, pad_left=True, pad_right=True, left_pad_symbol='_', right_pad_symbol='_')))
ngrams_flat_tuples = [ngram for ngram_list in ngrams_list for ngram in ngram_list]
format_string = ''
for i in range(0, n):
format_string += ('%s')
ngrams_list_flat = [format_string % ngram_tuple for ngram_tuple in ngrams_flat_tuples]
return ngrams_list_flat
# Feature extractor
def get_ngram_features(sentence_tokens):
features = {}
# Unigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 1)
for ngram in ngrams:
features[f'char({ngram})'] = features.get(f'char({ngram})', 0) + 1
# Bigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 2)
for ngram in ngrams:
features[f'bigram({ngram})'] = features.get(f'bigram({ngram})', 0) + 1
# Trigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 3)
for ngram in ngrams:
features[f'trigram({ngram})'] = features.get(f'trigram({ngram})', 0) + 1
# Quadrigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 4)
for ngram in ngrams:
features[f'quadrigram({ngram})'] = features.get(f'quadrigram({ngram})', 0) + 1
return features
get_ngram_features(['test', 'sentence'])
Returns:
{'char(c)': 1,
'char(e)': 4,
'char(n)': 2,
'char(s)': 2,
'char(t)': 3,
'bigram(_s)': 1,
'bigram(_t)': 1,
'bigram(ce)': 1,
'bigram(e_)': 1,
'bigram(en)': 2,
'bigram(es)': 1,
'bigram(nc)': 1,
'bigram(nt)': 1,
'bigram(se)': 1,
'bigram(st)': 1,
'bigram(t_)': 1,
'bigram(te)': 2,
'quadrigram(_sen)': 1,
'quadrigram(_tes)': 1,
'quadrigram(ence)': 1,
'quadrigram(ente)': 1,
'quadrigram(est_)': 1,
'quadrigram(nce_)': 1,
'quadrigram(nten)': 1,
'quadrigram(sent)': 1,
'quadrigram(tenc)': 1,
'quadrigram(test)': 1,
'trigram(_se)': 1,
'trigram(_te)': 1,
'trigram(ce_)': 1,
'trigram(enc)': 1,
'trigram(ent)': 1,
'trigram(est)': 1,
'trigram(nce)': 1,
'trigram(nte)': 1,
'trigram(sen)': 1,
'trigram(st_)': 1,
'trigram(ten)': 1,
'trigram(tes)': 1}
It's easier to use other libraries for this purpose. It's easier to do something like this https://www.kaggle.com/alvations/basic-nlp-with-nltk with sklearn
using a custom analyzer, e.g. CountVectorizer(analyzer=preprocess_text)
For example:
from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from nltk import everygrams
def sent_process(sent):
return [''.join(ng) for ng in everygrams(sent.replace(' ', '_ _'), 1, 4)
if ' ' not in ng and '\n' not in ng and ng != ('_',)]
sent1 = "The quick brown fox jumps over the lazy brown dog."
sent2 = "Mr brown jumps over the lazy fox."
sent3 = 'Mr brown quickly jumps over the lazy dog.'
sent4 = 'The brown quickly jumps over the lazy fox.'
with StringIO('\n'.join([sent1, sent2])) as fin:
# Override the analyzer totally with our preprocess text
count_vect = CountVectorizer(analyzer=sent_process)
count_vect.fit_transform(fin)
count_vect.vocabulary_
train_set = count_vect.fit_transform([sent1, sent2])
# To train the classifier
clf = MultinomialNB()
clf.fit(train_set, ['pos', 'neg'])
test_set = count_vect.transform([sent3, sent4])
clf.predict(test_set)
Firstly, there's really no need to explicitly label the char(...)
, unigram(...)
, bigram(...)
, trigram(...)
and quadrigram(...)
part to the features.
The feature set are just dictionary keys and you can use the actual ngram tuple as the keys, e.g.
from collections import Counter
from nltk import ngrams, word_tokenize
features = Counter(ngrams(word_tokenize('This is a something foo foo bar foo foo sentence'), 2))
[out]:
>>> features
Counter({('This', 'is'): 1,
('a', 'something'): 1,
('bar', 'foo'): 1,
('foo', 'bar'): 1,
('foo', 'foo'): 2,
('foo', 'sentence'): 1,
('is', 'a'): 1,
('something', 'foo'): 1})
As for ngrams of several orders, you can use everygrams()
, e.g.
from nltk import everygrams
sent = word_tokenize('This is a something foo foo bar foo foo sentence')
Counter(everygrams(sent, 1, 4))
[out]:
Counter({('This',): 1,
('This', 'is'): 1,
('This', 'is', 'a'): 1,
('This', 'is', 'a', 'something'): 1,
('a',): 1,
('a', 'something'): 1,
('a', 'something', 'foo'): 1,
('a', 'something', 'foo', 'foo'): 1,
('bar',): 1,
('bar', 'foo'): 1,
('bar', 'foo', 'foo'): 1,
('bar', 'foo', 'foo', 'sentence'): 1,
('foo',): 4,
('foo', 'bar'): 1,
('foo', 'bar', 'foo'): 1,
('foo', 'bar', 'foo', 'foo'): 1,
('foo', 'foo'): 2,
('foo', 'foo', 'bar'): 1,
('foo', 'foo', 'bar', 'foo'): 1,
('foo', 'foo', 'sentence'): 1,
('foo', 'sentence'): 1,
('is',): 1,
('is', 'a'): 1,
('is', 'a', 'something'): 1,
('is', 'a', 'something', 'foo'): 1,
('sentence',): 1,
('something',): 1,
('something', 'foo'): 1,
('something', 'foo', 'foo'): 1,
('something', 'foo', 'foo', 'bar'): 1})
A clean way to extract the features you want:
def sent_vectorizer(sent):
return [''.join(ng) for ng in everygrams(sent.replace(' ', '_ _'), 1, 4)
if ' ' not in ng and ng != ('_',)]
Counter(sent_vectorizer('This is a something foo foo bar foo foo sentence'))
[out]:
Counter({'o': 9, 's': 4, 'e': 4, 'f': 4, '_f': 4, 'fo': 4, 'oo': 4, 'o_': 4, '_fo': 4, 'foo': 4, 'oo_': 4, '_foo': 4, 'foo_': 4, 'i': 3, 'n': 3, 'h': 2, 'a': 2, 't': 2, 'hi': 2, 'is': 2, 's_': 2, '_s': 2, 'en': 2, 'is_': 2, 'T': 1, 'm': 1, 'g': 1, 'b': 1, 'r': 1, 'c': 1, 'Th': 1, '_i': 1, '_a': 1, 'a_': 1, 'so': 1, 'om': 1, 'me': 1, 'et': 1, 'th': 1, 'in': 1, 'ng': 1, 'g_': 1, '_b': 1, 'ba': 1, 'ar': 1, 'r_': 1, 'se': 1, 'nt': 1, 'te': 1, 'nc': 1, 'ce': 1, 'Thi': 1, 'his': 1, '_is': 1, '_a_': 1, '_so': 1, 'som': 1, 'ome': 1, 'met': 1, 'eth': 1, 'thi': 1, 'hin': 1, 'ing': 1, 'ng_': 1, '_ba': 1, 'bar': 1, 'ar_': 1, '_se': 1, 'sen': 1, 'ent': 1, 'nte': 1, 'ten': 1, 'enc': 1, 'nce': 1, 'This': 1, 'his_': 1, '_is_': 1, '_som': 1, 'some': 1, 'omet': 1, 'meth': 1, 'ethi': 1, 'thin': 1, 'hing': 1, 'ing_': 1, '_bar': 1, 'bar_': 1, '_sen': 1, 'sent': 1, 'ente': 1, 'nten': 1, 'tenc': 1, 'ence': 1})
Unfortunately, there's no easy way to change the hardcoded manner of how the NaiveBayesClassifier
in NLTK works.
If we look at https://github.com/nltk/nltk/blob/develop/nltk/classify/naivebayes.py#L185 , behind the scenes NLTK is already counting the number of occurrence in the features.
But note, it's counting the document frequency, not term frequency, i.e. in that case regardless of how many times an element appears in the document, it counts as one. There isn't a clean way without changing the NLTK code to add the value of each feature since it's hardcoded to do +=1
, https://github.com/nltk/nltk/blob/develop/nltk/classify/naivebayes.py#L201