Search code examples
nlpnltknaivebayessentencenltk-trainer

nltk.org example of Sentence segmentation with Naive Bayes Classifier: how does .sent separate sentences and how does the ML algorithm improve it?


There is an example in nltk.org book (chapter 6) where they use a NaiveBayesian algorithm to classify a punctuation symbol as finishing a sentence or not finishing one...

This is what they do: First they take a corpus and use the .sent method to get the sentences and build an index from them of where the punctuation symbols that separate them (the boundaries) are.

Then they "tokenize" the text (convert it to list of words and punctuation symbols) and apply the following algorithm/function to each token so that they get a list of features which are returned in a dictionary:

def punct_features(tokens, i):
    return {'nextWordCapitalized': tokens[i+1][0].isupper(),
        'prevWord': tokens[i-1].lower(),
        'punct': tokens[i],
        'prevWordis1Char': len(tokens[i-1]) == 1}

These features will be used by the ML algorithm to classify the punctuation symbol as finishing a sentence or not (i.e as a boundary token).

With this fn and the 'boundaries' index, they select all the punctuation tokens, each with its features, and tag them as True boundary, or False one, thus creating a list of labeled feature-sets:

featuresets1 = [(punct_features(tokens, i), (i in boundaries)) for i in range(1, len(tokens)-1)
               if tokens[i] in '.?!;']
print(featuresets1[:4])

This is an example of the outpout we could have when printing the first four sets:

[({'nextWordCapitalized': False, 'prevWord': 'nov', 'punct': '.', 'prevWordis1Char': False}, False), 
({'nextWordCapitalized': True, 'prevWord': '29', 'punct': '.', 'prevWordis1Char': False}, True), 
({'nextWordCapitalized': True, 'prevWord': 'mr', 'punct': '.', 'prevWordis1Char': False}, False), 
({'nextWordCapitalized': True, 'prevWord': 'n', 'punct': '.', 'prevWordis1Char': True}, False)]

With this, they train and evaluate the punctuation classifier:

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

nltk.classify.accuracy(classifier, test_set)

Now, (1) how and what would such a ML algorithm improve? I can't grasp how could it better the first simple algorithm that just checks if next token from the punctuation symbol is Uppercase and previous is lowercase. Indeed that algorithm is taken to validate that a symbol is a boundary...! And if it doesn't improve it, what could possibly be useful for?

And related with this: (2) is any of these two algorithms how nlpk really separates sentences? I mean, specially if the best is the first simple one, does nltk understand that sentences is just a text between two punctuation symbols that are followed by a word with first chart in uppercase and previous word in lowercase? Is this what .sent method does? Notice that this is far from how Linguistics or better said, the Oxford dictionary, defines a sentence:

"A set of words that is complete in itself, typically containing a subject and predicate, conveying a statement, question, exclamation, or command, and consisting of a main clause and sometimes one or more subordinate clauses."

Or (3) are the raw corpora texts like treebank or brown already divided by sentences manually? - in this case, what is the criterion to select them?


Solution

  • Question (1): NLTK perhaps did not make it clear, but sentence segmentation is a difficult problem. Like you said, we can start with the assumption that a punctuation marker ends the sentence i.e., previous character is lower case, current character is punctuation, next char is uppercase (btw, there are spaces in between! don't forget!). However, consider this sentence:

    "Mr. Peter works at a company called A.B.C. Inc. in Toronto. His net salary per month is $2344.21. 22 years ago, he came to Toronto as an immigrant." - Now, going by our rule above, how will this be split?

    The Wikipedia page on sentence boundary disambiguation illustrates a few more of these issues. In the NLP textbook "Speech and Language Processing" by Jurafsky and Martin, they also have a chapter on Text normaliazation, with a few more examples of why word/sentence segmentation can be challenging - it could be useful for you to get an idea of this. I am assuming we are discussing about English segmentation, but clearly there are other issues with other languages (e.g., no capitalization in some languages).

    Q 2: is any of these two algorithms how nlpk really separates sentences? NLTK uses a unsupervised sentence segmentation method called PunktSentenceTokenizer

    Q3: are the raw corpora texts like treebank or brown already divided by sentences manually? - Yes, these were manually divided into sentences. These are some common corpora used in NLP for developing lingustic tools such as POS taggers, parsers etc. One reason for choosing these could be that they are already available within NLTK, and we don't have to look for another human annotated corpus to do supervised learning of sentence boundary detection.