Search code examples
regexpython-3.xsentence

How can I identify sentences within a text?


I have text that looks like this:-

"I am an engineer. I am skilled in ASP.NET. I also know Node.js.But I don't have much experience. "

Here, "ASP.NET" and "Node.js" are to be treated as words. Also, there is no space before "But I...", but it should be treated as a separate sentence.

The expected output is:

["I am an engineer"," I am skilled in ASP.NET","I also know Node.js","But I don't have much experience"]

Is there a way of doing this?


Solution

  • For your current input you may use the following approach with re.split() function and specific regex pattern:

    import re
    
    s = "I am an engineer. I am skilled in ASP.NET. I also know Node.js.But I don't have much experience. "
    result = re.split(r'\.(?=\s?[A-Z][^.]*? )', s)
    
    print(result)
    

    The output:

    ['I am an engineer', ' I am skilled in ASP.NET', ' I also know Node.js', "But I don't have much experience. "]
    

    (?=\s?[A-Z][^.]*? ) - lookahead positive assertion, ensures that sentence delimiter . is followed by word from next sentence