I work with a very messy text when sentences are not capitalised and punctuations are separated properly. I need to add spaces when missing after punctuations [.,:;)!?] but not in decimal numbers or abbreviations.
This is an example:
mystring = 'this is my first sentence with (brackets)in it. this is the second?What about this sentence with D.D.T. in it?or this with 4.5?'
This is where I've got so far.
def fix_punctuation(text):
def sentence_case(text):
# Split into sentences. Therefore, find all text that ends
# with punctuation followed by white space or end of string.
sentences = re.findall('[^.!?]+[.!?](?:\s|\Z)', text)
# Capitalize the first letter of each sentence
sentences = [x[0].upper() + x[1:] for x in sentences]
# Combine sentences
return ''.join(sentences)
#add space after punctuation
text = re.sub('([.,;:!?)])', r'\1 ', text)
#capitalize sentences
text = sentence_case(text)
return text
Which gives me this output:
'This is my first sentence with (brackets) in it. this is the second? What about this sentence with D. D. T. in it? Or this with 4. 5? '
I tried methods suggested here and here but they didn't work with my case. Regex makes my brain hurt so I will really appreciate your help.
I understand you want to ignore the periods inside numbers and period-separated single-letter chunks with an optional period right after.
Here is a code snippet that implements the logic I described above:
import re
mystring = 'this is my first sentence with (brackets)in it. this is the second?What about this sentence with D.D.T. in it?or this with 4.5?'
def fix_punctuation(text):
def sentence_case(text):
# Split into sentences. Therefore, find all text that ends
# with punctuation followed by white space or end of string.
sentences = re.findall(r'(?:\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?|[^.!?])+[.!?](?:\s|\Z)', text)
# Capitalize the first letter of each sentence
sentences = [x[0].upper() + x[1:] for x in sentences]
# Combine sentences
return ''.join(sentences)
#add space after punctuation
text = re.sub(r'(\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?)|([.,;:!?)])\s*', lambda x: x.group(1) or f'{x.group(2)} ', text)
#capitalize sentences
return sentence_case(text)
print(fix_punctuation(mystring))
# => This is my first sentence with (brackets) in it. This is the second?
# What about this sentence with D.D.T. in it? Or this with 4.5?
See the Python demo.
The re.findall
pattern, (?:\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?|[^.!?])+[.!?](?:\s|\Z)
, matches
(?:\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?|[^.!?])+
- one or more occurrences of
\d+\.\d+
- one or more digits, .
, one or more digits|
- or\b[A-Z](?:\.[A-Z])*\b\.?
- a word boundary, an uppercase letter, zero or more repetitions of a period and an uppercase letter, a word boundary and an optional .
|
- or[^.!?]
- a char other than .
, !
and ?
[.!?]
- .
, !
or ?
(?:\s|\Z)
- a whitespace or end of string.The re.sub
pattern, (\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?)|([.,;:!?)])\s*
, matches and captures into Group 1 those patterns that we want to skip, and then matches and captures into Group 2 some punctuation chars and then matches any zero or more whitespace chars (to make sure we only have a single space after them), and a custom logic is used in the replacement argument, in the lambda expression.