Search code examples
pythonparsingnlp

Accurately splitting sentences


My program takes a text file and splits each sentence into a list using split('.') meaning that it will split when it registers a full stop however it can be inaccurate.

For Example

str='i love carpets. In fact i own 2.4 km of the stuff.'

Output

listOfSentences = ['i love carpets', 'in fact i own 2', '4 km of the stuff']

Desired Output

 listOfSentences = ['i love carpets', 'in fact i own 2.4 km of the stuff']

My question is: How do I split the end of sentences and not at every full stop.


Solution

  • If you have sentences both ending with "." and ". ", you can try regex:

    import re
    
    text = "your text here. i.e. something."
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    

    source: Python - RegEx for splitting text into sentences (sentence-tokenizing)