Search code examples
nltk

How to sentence tokenize based on semi colon using nltk


Hi everbody I am new to text processing, so any help is appreciated. I am trying to sentence tokenize based on period AND semi colons. How can I do this using nltk.sent_tokenize?

For example if we have this sentence and we sentence tokenize it:

sentence = "The court ruled out the judgement; First round proceedings; The court declared unjustfied. Case proceedings were carried out in the morning. Objections were raised;"

sentences = nltk.sent_tokenize(sentence)

sentences should be:

[
 "The court ruled out the judgement;",
 "First round proceedings;",
 "The court declared unjustfied.",
 "Case proceedings were carried out in the morning.",
 "Objections were raised;"
]

Solution

  • This has been addressed in a previous post here. You can do so by updating sent_end_chars in the default tokenizer in nltk (PunktTokenizer) using PunktLangVars as below:

    from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars
    
    # Create new lang vars with semicolon
    class MyLangVars(PunktLanguageVars):
        sent_end_chars = {'.', '?', '!', ';'}
    
    # Create tokenizer with new language variables
    tokenizer = PunktSentenceTokenizer(lang_vars = MyLangVars())
    
    # Tokenize text
    sentence = "The court ruled out the judgement; First round proceedings; The court declared unjustfied. Case proceedings were carried out in the morning. Objections were raised;"
    sentences = tokenizer.tokenize(sentence)
    print(sentences)
    

    This has output:

    [
    'The court ruled out the judgement;', 
    'First round proceedings;', 
    'The court declared unjustfied.', 
    'Case proceedings were carried out in the morning.', 
    'Objections were raised;'
    ]