Search code examples
pythonregexwhitespace

Regex to remove not repeated spaces


I fetch name information from a PDF in python with fitz.

Problem is, most of the informations have spaces to match the background, which give me for example : firstname = "P I E R R E" and lastname "L E D U C D E C O L".

I need to remove spaces between characters that are not next to an other space.

Of course at first I removed all spaces with "s/\s//g" but for the name it give me "LEDUCDECOL" and I need "LE DUC DE COL".


Solution

  • You could match a single space , and in a repeating capture group match optional following spaces which will keep the value of the last iteration (a single space) in the capture group.

    In the replacement use the group 1 value using \1

     ( )*
    

    If you want to match a whitespace char, you can replace the space with \s but note that it can also match a newline:

    \s(\s)*
    

    See a regex demo and a Python demo.

    For example:

    import re
     
    strings = [
        "L E  D U C  D E  C O L",
        "a        b     c def g"
    ]
    pattern = r" ( )*"
    for s in strings:
        print(re.sub(pattern, r"\1", s))
    

    Output

    LE DUC DE COL
    a b cdefg
    

    If you want to match a single space that is not followed by another space, you can use a negative lookahead, and use an empty string in the replacement:

     (?! )
    

    See another regex demo.