I have some text for latex that I am working on and I need to clean it in order to split it properly based on spacing.
So the string:
\\mathrm l >\\mathrm li ^ + >\\mathrm mg ^ +>\\mathrm a \\beta+ \\mathrm co
should be:
\\mathrm l > \\mathrm li ^ + > \\mathrm mg ^ + > \\mathrm a \\beta + \\mathrm co
So in order for me to split it, I have to create spacing between every character if it is a special character. Also I want to keep the latex notation intact as \something
.
I can have re.compile([a-zA-Z0-9 \\])
to get all the special characters but then how can I approach to inser spaces?
I have written a code something like this but it does not look good in terms of efficiency. (or is it?)
def insert_space(sentence):
'''
Add a space around special characters So "x+y +-=y \\latex" becomes: "x + y + - = y \\latex"
'''
string = ''
for i in sentence:
if (not i.isalnum()) and i not in [' ','\\']:
string += ' '+i+' '
else:
string += i
return re.sub('\s+', ' ',string)
I haven't used LaTeX so if you're sure that [a-zA-Z0-9 \\]
captures everything that isn't a special character you could do something like this.
import re
def insert_space(sentence):
sentence = re.sub(r'(?<! )(?![a-zA-Z0-9 \\])', ' ', sentence)
sentence = re.sub(r'(?<!^)(?<![a-zA-Z0-9 \\])(?! )', ' ', sentence)
return sentence
my_string = '\\mathrm l >\\mathrm li ^ + >\\mathrm mg ^ +>\\mathrm a \\beta+ \\mathrm co'
print('before', my_string)
# before \mathrm l >\mathrm li ^ + >\mathrm mg ^ +>\mathrm a \beta+ \mathrm co
print('after', insert_space(my_string))
# after \mathrm l > \mathrm li ^ + > \mathrm mg ^ + > \mathrm a \beta + \mathrm co
The first regex is:
(?<! )
Negative look behind for a space.(?![a-zA-Z0-9 \\])
Negative look ahead for the character class you specified.' '
.The second regex is:
(?<!^)
Negative look behind for the start of the string.(?<![a-zA-Z0-9 \\])
Negative look behind for the character class you specified.(?! )
Negative look ahead for a space.' '
.So effectively, it's first finding all the spaces between special characters and another character that is not a space and inserting a space at that position.
The reason you need to also include (?<!^)
is to ignore the position between the start of the string and the first character. Otherwise it will include an extra space at the beginning.