Search code examples
pythonnlppython-re

How to Insert space between a special character and everything else


I have some text for latex that I am working on and I need to clean it in order to split it properly based on spacing.

So the string:

\\mathrm l  >\\mathrm li ^ + >\\mathrm mg ^   +>\\mathrm a  \\beta+  \\mathrm co

should be:

\\mathrm l  > \\mathrm li ^ + > \\mathrm mg ^   + > \\mathrm a  \\beta +  \\mathrm co

So in order for me to split it, I have to create spacing between every character if it is a special character. Also I want to keep the latex notation intact as \something.

I can have re.compile([a-zA-Z0-9 \\]) to get all the special characters but then how can I approach to inser spaces?

I have written a code something like this but it does not look good in terms of efficiency. (or is it?)

def insert_space(sentence):
    '''
    Add a space around special characters So "x+y +-=y \\latex" becomes: "x + y + - = y \\latex"
    '''
    string = ''
    for i in sentence:
        if (not i.isalnum()) and i not in [' ','\\']:
            string += ' '+i+' '
        else:
            string += i
    return re.sub('\s+', ' ',string)

Solution

  • I haven't used LaTeX so if you're sure that [a-zA-Z0-9 \\] captures everything that isn't a special character you could do something like this.

    import re
    
    def insert_space(sentence):
        sentence = re.sub(r'(?<! )(?![a-zA-Z0-9 \\])', ' ', sentence)
        sentence = re.sub(r'(?<!^)(?<![a-zA-Z0-9 \\])(?! )', ' ', sentence)
        return sentence
    
    my_string = '\\mathrm l  >\\mathrm li ^ + >\\mathrm mg ^   +>\\mathrm a  \\beta+  \\mathrm co'
    print('before', my_string)
    # before \mathrm l  >\mathrm li ^ + >\mathrm mg ^   +>\mathrm a  \beta+  \mathrm co
    print('after', insert_space(my_string))
    # after \mathrm l  > \mathrm li ^ + > \mathrm mg ^   + > \mathrm a  \beta +  \mathrm co 
    

    The first regex is:

    • (?<! ) Negative look behind for a space.
    • (?![a-zA-Z0-9 \\]) Negative look ahead for the character class you specified.
    • Replace all of these occurrences with a space ' '.

    The second regex is:

    • (?<!^) Negative look behind for the start of the string.
    • (?<![a-zA-Z0-9 \\]) Negative look behind for the character class you specified.
    • (?! ) Negative look ahead for a space.
    • Replace all of these occurrences with a space ' '.

    So effectively, it's first finding all the spaces between special characters and another character that is not a space and inserting a space at that position.

    The reason you need to also include (?<!^) is to ignore the position between the start of the string and the first character. Otherwise it will include an extra space at the beginning.