I am segmenting sentences for a text in python using nltk PunktSentenceTokenizer()
. However, there are many long sentences appears in a enumerated way and I need to get the sub sentence in this case.
Example:
The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.
The required output would be :
"The api allows the user to achieve following goals aXXXXX. "
, "The api allows the user to achieve following goals bXXXXX."
and "The api allows the user to achieve following goals cXXXXX. "
How can I achieve this goal?
To get the sub-sequences you could use a RegExp Tokenizer.
An example how to use it to split the sentence could look like this:
from nltk.tokenize.regexp import regexp_tokenize
str1 = 'The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.'
parts = regexp_tokenize(str1, r'\(\w\)\s*', gaps=True)
start_of_sentence = parts.pop(0)
for part in parts:
print(" ".join((start_of_sentence, part)))