Search code examples
pythonregexcapitalization

Preserve paragraph marks while capitalzing - RegEx


p = re.compile(r'((?<=[\.\?!]\s)(\w+)|(^\w+))')
def cap(match):
    return(match.group().capitalize())
capitalized_1 = p.sub(cap, Inputfile)

with codecs.open('o.txt', mode="w", encoding="utf_8") as file:
  file.write(capitalized_1)

I am using Regex to capitalize letters after . ? ! which the code above does. But it takes away paragraph marks(page break pilcrow) and lumps it into one big paragraph.

How to preserve the paragraph marks and prevent clumping?

Input file:

on the insert tab, the galleries include items that are designed to coordinate with the overall look of your document. you can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. when you create pictures, charts, or diagrams, they also coordinate with your current document look.

you can easily change the formatting of selected text in the document text by choosing a look for the selected text from the quick styles gallery on the home tab. you can also format text directly by using the other controls on the home tab. most controls offer a choice of using the look from the current theme or using a format that you specify directly.

Current output

On the insert tab, the galleries include items that are designed to coordinate with the overall look of your document. You can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. When you create pictures, charts, or diagrams, they also coordinate with your current document look. You can easily change the formatting of selected text in the document text by choosing a look for the selected text from the quick styles gallery on the home tab. You can also format text directly by using the other controls on the home tab. most controls offer a choice of using the look from the current theme or using a format that you specify directly.

Expected output:

On the insert tab, the galleries include items that are designed to coordinate with the overall look of your document. You can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. When you create pictures, charts, or diagrams, they also coordinate with your current document look.

You can easily change the formatting of selected text in the document text by choosing a look for the selected text from the quick styles gallery on the home tab. You can also format text directly by using the other controls on the home tab. Most controls offer a choice of using the look from the current theme or using a format that you specify directly.

Edit 1:

import re,codecs
def capitalize(match):
    return ''.join([match.group(1), match.group(2).capitalize()])

with codecs.open('i.txt', encoding='utf-8') as f:
    text = f.read()
    
pattern = re.compile('(^|[.?!]\s+)(\w+)?')

print(pattern.sub(capitalize, text))

Throws error while i try to read it from a file based on answer 1 approach.

return ''.join([match.group(1), match.group(2).capitalize()])
AttributeError: 'NoneType' object has no attribute 'capitalize'

Solution

  • You can do it like this:

    import re
    
    
    def capitalize(match):
        return ''.join([match.group(1), match.group(2).capitalize()])
    
    text = """on the insert tab, the galleries include items that are designed to coordinate with the overall look of your document. you can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. when you create pictures, charts, or diagrams, they also coordinate with your current document look.
    
    you can easily change the formatting of selected text in the document text by choosing a look for the selected text from the quick styles gallery on the home tab. you can also format text directly by using the other controls on the home tab. most controls offer a choice of using the look from the current theme or using a format that you specify directly."""
    
    pattern = re.compile('(^|[.?!]\s+)(\w+)?')
    
    print(pattern.sub(capitalize, text))
    

    Output

    On the insert tab, the galleries include items that are designed to coordinate with the overall look of your document. You can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. When you create pictures, charts, or diagrams, they also coordinate with your current document look.
    
    You can easily change the formatting of selected text in the document text by choosing a look for the selected text from the quick styles gallery on the home tab. You can also format text directly by using the other controls on the home tab. Most controls offer a choice of using the look from the current theme or using a format that you specify directly.
    

    Notes

    • (^|[.?!]\s+) means capture a . (dot), ? or ! followed by one or more whitespaces characters (tab, space, etc). The ^ means the start of the string; so in full this group means the start of the sentence or a .?! followed by a whitespace.
    • (\w+)? means one or more word characters
    • The capitalize function then preserves what was matched on the first group and capitalize the second group (the word).