It is two part question:
Part 1
To remove multiple white spaces, paragraph breaks to just one.
current code:
import re
# Read inputfile
with open('input.txt', 'r') as file :
inputfile = file.read()
# Replace extras spaces with single space.
#outputfile = re.sub('\s+', ' ', inputfile).strip()
outputfile = ' '.join(inputfile.split(None))
# Write outputfile
with open('output.txt', 'w') as file:
file.write(outputfile)
Part 2:
Once the extra spaces are removed; I search and replace pattern mistakes.
Like: ' [ ' to ' ['
Pattern1 = re.sub(' [ ', ' [', inputfile)
which throws an error:
raise error, v # invalid expression error: unexpected end of regular expression
Although. This works...(for example: to join words together before and after hyphen)
Pattern1 = re.sub(' - ', '-', inputfile)
I got many situations to handle with respect to punctuation problem after spacing issue is solved.
I don't want patterns to look into the output of previous pattern results and move further.
Is there a better approach to cut spaces around punctuation to just right.
For the first part, you can split it by newline blocks, compress each line, and then join it back on newlines, like so:
import re
text = "\n".join(re.sub(r"\s+", " ", line) for line in re.split("\n+", text))
print(text)
For the second part, you need to escape [
since it's a regex metacharacter (used to define character classes), like so:
import re
text = re.sub("\[ ", "[", text)
text = re.sub(" ]", "]", text)
print(text)
Note that you don't need to escape the ]
because it doesn't match a [
so it isn't special in this context.
Alternatively for the second part, text = text.replace("[ ", "[").replace(" ]", "]")
because you don't even need regular expressions.