Search code examples
python-3.xregextext

Regex to detect pattern and remove spaces from that pattern in Python


I have a file that contains segments that form a word in the following format <+segment1 segment2 segment3 segment4+>, what I want to have is an output with all the segments beside each other to form one word (So basically I want to remove the space between the segments and the <+ +> sign surronding the segments). So for example:

Input:

<+play ing+> <+game s .+>

Output:

playing games. 

I tried first detecting the pattern using \<\+(.*?)\+\> but I cannot seem to know how to remove the spaces


Solution

  • Use this Python code:

    import re
    line = '<+play ing+> <+game s .+>'
    line = re.sub(r'<\+\s*(.*?)\s*\+>', lambda z: z.group(1).replace(" ", ""), line)
    print(line)
    

    Results: playing games.

    The lambda removes spaces additionally.

    REGEX EXPLANATION

    --------------------------------------------------------------------------------
      <                        '<'
    --------------------------------------------------------------------------------
      \+                       '+'
    --------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    --------------------------------------------------------------------------------
      (                        group and capture to \1:
    --------------------------------------------------------------------------------
        .*?                      any character except \n (0 or more times
                                 (matching the least amount possible))
    --------------------------------------------------------------------------------
      )                        end of \1
    --------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    --------------------------------------------------------------------------------
      \+                       '+'
    --------------------------------------------------------------------------------
      >                        '>'