python machine-learning nltk text-classification n-gram

NLTK classifier for integer features?

I have integer-type features in my feature vector that NLTK’s NaiveBayesClassifier is treating as nominal values.

Context

I am trying to build a language classifier using n-grams. For instance, the bigram ‘th’ is more common in English than French.

For each sentence in my training set, I extract a feature as follows: bigram(th): 5 where 5 (example) represents the number of times the bigram ‘th’ appeared in the sentence.

When I try building a classifier with features like this and I check the most informative features, I realize that the classifier does not realize that such features are linear. For example, it might consider bigram(ea): 4 as French, bigram(ea): 5 as English and bigram(ea): 6 as French again. This is quite arbitrary and does not represent the logic that a bigram is either more common in English or in French. This is why I need the integers to be treated as such.

More thoughts

Of course, I could replace these features with features such as has(th): True. However, I believe this is a bad idea because both a French sentence with 1 instance of 'th' and an English sentence with 5 instances of 'th' will have the feature has(th): True which cannot differentiate them.

I also found this relevant link but it did not provide me with the answer.

Feature Extractor

My feature extractor looks like this:

def get_ngrams(word, n):
    ngrams_list = []
    ngrams_list.append(list(ngrams(word, n, pad_left=True, pad_right=True, left_pad_symbol='_', right_pad_symbol='_')))
    ngrams_flat_tuples = [ngram for ngram_list in ngrams_list for ngram in ngram_list]
    format_string = ''
    for i in range(0, n):
        format_string += ('%s')
    ngrams_list_flat = [format_string % ngram_tuple for ngram_tuple in ngrams_flat_tuples]
    return ngrams_list_flat

# Feature extractor
def get_ngram_features(sentence_tokens):
    features = {}
    # Unigrams
    for word in sentence_tokens:
        ngrams = get_ngrams(word, 1)
        for ngram in ngrams:
            features[f'char({ngram})'] = features.get(f'char({ngram})', 0) + 1
    # Bigrams
    for word in sentence_tokens:
        ngrams = get_ngrams(word, 2)
        for ngram in ngrams:
            features[f'bigram({ngram})'] = features.get(f'bigram({ngram})', 0) + 1
    # Trigrams
    for word in sentence_tokens:
        ngrams = get_ngrams(word, 3)
        for ngram in ngrams:
            features[f'trigram({ngram})'] = features.get(f'trigram({ngram})', 0) + 1
    # Quadrigrams
    for word in sentence_tokens:
        ngrams = get_ngrams(word, 4)
        for ngram in ngrams:
            features[f'quadrigram({ngram})'] = features.get(f'quadrigram({ngram})', 0) + 1
    return features

Feature Extraction Example

get_ngram_features(['test', 'sentence'])

Returns:

{'char(c)': 1,
 'char(e)': 4,
 'char(n)': 2,
 'char(s)': 2,
 'char(t)': 3,
 'bigram(_s)': 1,
 'bigram(_t)': 1,
 'bigram(ce)': 1,
 'bigram(e_)': 1,
 'bigram(en)': 2,
 'bigram(es)': 1,
 'bigram(nc)': 1,
 'bigram(nt)': 1,
 'bigram(se)': 1,
 'bigram(st)': 1,
 'bigram(t_)': 1,
 'bigram(te)': 2,
 'quadrigram(_sen)': 1,
 'quadrigram(_tes)': 1,
 'quadrigram(ence)': 1,
 'quadrigram(ente)': 1,
 'quadrigram(est_)': 1,
 'quadrigram(nce_)': 1,
 'quadrigram(nten)': 1,
 'quadrigram(sent)': 1,
 'quadrigram(tenc)': 1,
 'quadrigram(test)': 1,
 'trigram(_se)': 1,
 'trigram(_te)': 1,
 'trigram(ce_)': 1,
 'trigram(enc)': 1,
 'trigram(ent)': 1,
 'trigram(est)': 1,
 'trigram(nce)': 1,
 'trigram(nte)': 1,
 'trigram(sen)': 1,
 'trigram(st_)': 1,
 'trigram(ten)': 1,
 'trigram(tes)': 1}

Solution

TL;DR

It's easier to use other libraries for this purpose. It's easier to do something like this https://www.kaggle.com/alvations/basic-nlp-with-nltk with sklearn using a custom analyzer, e.g. CountVectorizer(analyzer=preprocess_text)

For example:

from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from nltk import everygrams

def sent_process(sent):
    return [''.join(ng) for ng in everygrams(sent.replace(' ', '_ _'), 1, 4) 
            if ' ' not in ng and '\n' not in ng and ng != ('_',)]

sent1 = "The quick brown fox jumps over the lazy brown dog."
sent2 = "Mr brown jumps over the lazy fox."
sent3 = 'Mr brown quickly jumps over the lazy dog.'
sent4 = 'The brown quickly jumps over the lazy fox.'

with StringIO('\n'.join([sent1, sent2])) as fin:
    # Override the analyzer totally with our preprocess text
    count_vect = CountVectorizer(analyzer=sent_process)
    count_vect.fit_transform(fin)
count_vect.vocabulary_ 


train_set = count_vect.fit_transform([sent1, sent2])

# To train the classifier
clf = MultinomialNB() 
clf.fit(train_set, ['pos', 'neg']) 

test_set = count_vect.transform([sent3, sent4])
clf.predict(test_set)

Cut-away

Firstly, there's really no need to explicitly label the char(...), unigram(...), bigram(...), trigram(...) and quadrigram(...) part to the features.

The feature set are just dictionary keys and you can use the actual ngram tuple as the keys, e.g.

from collections import Counter
from nltk import ngrams, word_tokenize

features = Counter(ngrams(word_tokenize('This is a something foo foo bar foo foo sentence'), 2))

[out]:

>>> features
Counter({('This', 'is'): 1,
         ('a', 'something'): 1,
         ('bar', 'foo'): 1,
         ('foo', 'bar'): 1,
         ('foo', 'foo'): 2,
         ('foo', 'sentence'): 1,
         ('is', 'a'): 1,
         ('something', 'foo'): 1})

As for ngrams of several orders, you can use everygrams(), e.g.

from nltk import everygrams

sent = word_tokenize('This is a something foo foo bar foo foo sentence')
Counter(everygrams(sent, 1, 4))

[out]:

Counter({('This',): 1,
         ('This', 'is'): 1,
         ('This', 'is', 'a'): 1,
         ('This', 'is', 'a', 'something'): 1,
         ('a',): 1,
         ('a', 'something'): 1,
         ('a', 'something', 'foo'): 1,
         ('a', 'something', 'foo', 'foo'): 1,
         ('bar',): 1,
         ('bar', 'foo'): 1,
         ('bar', 'foo', 'foo'): 1,
         ('bar', 'foo', 'foo', 'sentence'): 1,
         ('foo',): 4,
         ('foo', 'bar'): 1,
         ('foo', 'bar', 'foo'): 1,
         ('foo', 'bar', 'foo', 'foo'): 1,
         ('foo', 'foo'): 2,
         ('foo', 'foo', 'bar'): 1,
         ('foo', 'foo', 'bar', 'foo'): 1,
         ('foo', 'foo', 'sentence'): 1,
         ('foo', 'sentence'): 1,
         ('is',): 1,
         ('is', 'a'): 1,
         ('is', 'a', 'something'): 1,
         ('is', 'a', 'something', 'foo'): 1,
         ('sentence',): 1,
         ('something',): 1,
         ('something', 'foo'): 1,
         ('something', 'foo', 'foo'): 1,
         ('something', 'foo', 'foo', 'bar'): 1})

A clean way to extract the features you want:

def sent_vectorizer(sent):
    return [''.join(ng) for ng in everygrams(sent.replace(' ', '_ _'), 1, 4) 
            if ' ' not in ng and ng != ('_',)]
Counter(sent_vectorizer('This is a something foo foo bar foo foo sentence'))

[out]:

Counter({'o': 9, 's': 4, 'e': 4, 'f': 4, '_f': 4, 'fo': 4, 'oo': 4, 'o_': 4, '_fo': 4, 'foo': 4, 'oo_': 4, '_foo': 4, 'foo_': 4, 'i': 3, 'n': 3, 'h': 2, 'a': 2, 't': 2, 'hi': 2, 'is': 2, 's_': 2, '_s': 2, 'en': 2, 'is_': 2, 'T': 1, 'm': 1, 'g': 1, 'b': 1, 'r': 1, 'c': 1, 'Th': 1, '_i': 1, '_a': 1, 'a_': 1, 'so': 1, 'om': 1, 'me': 1, 'et': 1, 'th': 1, 'in': 1, 'ng': 1, 'g_': 1, '_b': 1, 'ba': 1, 'ar': 1, 'r_': 1, 'se': 1, 'nt': 1, 'te': 1, 'nc': 1, 'ce': 1, 'Thi': 1, 'his': 1, '_is': 1, '_a_': 1, '_so': 1, 'som': 1, 'ome': 1, 'met': 1, 'eth': 1, 'thi': 1, 'hin': 1, 'ing': 1, 'ng_': 1, '_ba': 1, 'bar': 1, 'ar_': 1, '_se': 1, 'sen': 1, 'ent': 1, 'nte': 1, 'ten': 1, 'enc': 1, 'nce': 1, 'This': 1, 'his_': 1, '_is_': 1, '_som': 1, 'some': 1, 'omet': 1, 'meth': 1, 'ethi': 1, 'thin': 1, 'hing': 1, 'ing_': 1, '_bar': 1, 'bar_': 1, '_sen': 1, 'sent': 1, 'ente': 1, 'nten': 1, 'tenc': 1, 'ence': 1})

In Long

Unfortunately, there's no easy way to change the hardcoded manner of how the NaiveBayesClassifier in NLTK works.

If we look at https://github.com/nltk/nltk/blob/develop/nltk/classify/naivebayes.py#L185 , behind the scenes NLTK is already counting the number of occurrence in the features.

But note, it's counting the document frequency, not term frequency, i.e. in that case regardless of how many times an element appears in the document, it counts as one. There isn't a clean way without changing the NLTK code to add the value of each feature since it's hardcoded to do +=1, https://github.com/nltk/nltk/blob/develop/nltk/classify/naivebayes.py#L201