Search code examples
pythonpython-3.xregexpython-re

Fixing sentences: add space after punctuation but not after decimal points or abbreviations


I work with a very messy text when sentences are not capitalised and punctuations are separated properly. I need to add spaces when missing after punctuations [.,:;)!?] but not in decimal numbers or abbreviations.

This is an example:

mystring = 'this is my first sentence with (brackets)in it. this is the second?What about this sentence with D.D.T. in it?or this with 4.5?'

This is where I've got so far.

def fix_punctuation(text):
    def sentence_case(text):
        # Split into sentences. Therefore, find all text that ends
        # with punctuation followed by white space or end of string.
        sentences = re.findall('[^.!?]+[.!?](?:\s|\Z)', text)

        # Capitalize the first letter of each sentence
        sentences = [x[0].upper() + x[1:] for x in sentences]

        # Combine sentences
        return ''.join(sentences)
    
    #add space after punctuation
    text = re.sub('([.,;:!?)])', r'\1 ', text)
    #capitalize sentences
    text = sentence_case(text)
    
    return text

Which gives me this output:

'This is my first sentence with (brackets) in it.  this is the second? What about this sentence with D. D. T.  in it? Or this with 4. 5? '

I tried methods suggested here and here but they didn't work with my case. Regex makes my brain hurt so I will really appreciate your help.


Solution

  • I understand you want to ignore the periods inside numbers and period-separated single-letter chunks with an optional period right after.

    Here is a code snippet that implements the logic I described above:

    import re
    
    mystring = 'this is my first sentence with (brackets)in it. this is the second?What about this sentence with D.D.T. in it?or this with 4.5?'
    
    def fix_punctuation(text):
        def sentence_case(text):
            # Split into sentences. Therefore, find all text that ends
            # with punctuation followed by white space or end of string.
            sentences = re.findall(r'(?:\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?|[^.!?])+[.!?](?:\s|\Z)', text)
    
            # Capitalize the first letter of each sentence
            sentences = [x[0].upper() + x[1:] for x in sentences]
    
            # Combine sentences
            return ''.join(sentences)
        
        #add space after punctuation
        text = re.sub(r'(\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?)|([.,;:!?)])\s*', lambda x: x.group(1) or f'{x.group(2)} ', text)
        #capitalize sentences
        return sentence_case(text)
        
    print(fix_punctuation(mystring))
    # => This is my first sentence with (brackets) in it. This is the second?
    #    What about this sentence with D.D.T. in it? Or this with 4.5? 
    

    See the Python demo.

    The re.findall pattern, (?:\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?|[^.!?])+[.!?](?:\s|\Z), matches

    • (?:\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?|[^.!?])+ - one or more occurrences of
      • \d+\.\d+ - one or more digits, ., one or more digits
      • | - or
      • \b[A-Z](?:\.[A-Z])*\b\.? - a word boundary, an uppercase letter, zero or more repetitions of a period and an uppercase letter, a word boundary and an optional .
      • | - or
      • [^.!?] - a char other than ., ! and ?
    • [.!?] - ., ! or ?
    • (?:\s|\Z) - a whitespace or end of string.

    The re.sub pattern, (\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?)|([.,;:!?)])\s*, matches and captures into Group 1 those patterns that we want to skip, and then matches and captures into Group 2 some punctuation chars and then matches any zero or more whitespace chars (to make sure we only have a single space after them), and a custom logic is used in the replacement argument, in the lambda expression.