Hi everbody I am new to text processing, so any help is appreciated. I am trying to sentence tokenize based on period AND semi colons. How can I do this using nltk.sent_tokenize
?
For example if we have this sentence and we sentence tokenize it:
sentence = "The court ruled out the judgement; First round proceedings; The court declared unjustfied. Case proceedings were carried out in the morning. Objections were raised;"
sentences = nltk.sent_tokenize(sentence)
sentences
should be:
[
"The court ruled out the judgement;",
"First round proceedings;",
"The court declared unjustfied.",
"Case proceedings were carried out in the morning.",
"Objections were raised;"
]
This has been addressed in a previous post here. You can do so by updating sent_end_chars
in the default tokenizer in nltk (PunktTokenizer
) using PunktLangVars
as below:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars
# Create new lang vars with semicolon
class MyLangVars(PunktLanguageVars):
sent_end_chars = {'.', '?', '!', ';'}
# Create tokenizer with new language variables
tokenizer = PunktSentenceTokenizer(lang_vars = MyLangVars())
# Tokenize text
sentence = "The court ruled out the judgement; First round proceedings; The court declared unjustfied. Case proceedings were carried out in the morning. Objections were raised;"
sentences = tokenizer.tokenize(sentence)
print(sentences)
This has output:
[
'The court ruled out the judgement;',
'First round proceedings;',
'The court declared unjustfied.',
'Case proceedings were carried out in the morning.',
'Objections were raised;'
]