Search code examples
pythonnltktext-segmentation

How to segment text into sub-sentences based on enumerators?


I am segmenting sentences for a text in python using nltk PunktSentenceTokenizer(). However, there are many long sentences appears in a enumerated way and I need to get the sub sentence in this case.

Example:

The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX. 

The required output would be :

"The api allows the user to achieve following goals aXXXXX. ", "The api allows the user to achieve following goals bXXXXX." and "The api allows the user to achieve following goals cXXXXX. "

How can I achieve this goal?


Solution

  • To get the sub-sequences you could use a RegExp Tokenizer.

    An example how to use it to split the sentence could look like this:

    from nltk.tokenize.regexp import regexp_tokenize
    
    str1 = 'The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.'
    
    parts =  regexp_tokenize(str1, r'\(\w\)\s*', gaps=True)
    
    start_of_sentence = parts.pop(0)
    
    for part in parts:
        print(" ".join((start_of_sentence, part)))