Search code examples
pythonregexstringreplacepython-re

Regex string parsing: pattern starts with ; but can end with [;,)%&@]


I am attempting to parse strings using Regex. The strings look like:

Stack;O&verflow;i%s;the;best!

I want to parse it to:

Stack&verflow%sbest!

So when we see a ; remove everything up until we see one of the following characters: [;,)%&@] (or replace with empty space "").

I am using re package in Python:

string = re.sub('^[^-].*[)/]$', '', string)

This is what I have right now:

^[^;].*[;,)%&@]

Which as I understand it says: starting at the pattern with ;, read everything that matches in between ; and [;,)%&@] characters

But the result is wrong and looks like:

Stack;O&verflow;i%s;the;

Demo here.

What am I missing?

EDIT: @InSync pointed out that there is a discrepancy if ; is in the end characters as well. As worded above, it should result inStack&verflow%s**;**best! but instead I want to see Stack&verflow%sbest!. Perhaps two regex lines are appropriate here, I am not sure; if you can get to Stack&verflow%s**;**best! then the rest is just simple replacement of all the remaining ;.

EDIT2: The code I found that works was

import re

def remove_semicolons(name):
    name = re.sub(';.*?(?=[;,)%&@])', '', name)
    name = re.sub(';','',name)
    return name

remove_semicolons('Stack;O&verflow;i%s;the;best!')

Or if you feel like causing a headache to the next programmer who looks at your code:

import re

semicolon_string = 'Stack;O&verflow;i%s;the;best!'

cleaned_string = re.sub(';','',re.sub(';.*?(?=[;,)%&@])', '', semicolon_string))

Solution

  • Alright in my answer I assume you have a typo in your expected output. Remove everything starting with ; up to (;,)%&@) and so

    Stack ;O &verflow ;i %s ;the ;best!

    would become

    Stack&verflow%s;best!

    for the regex you want to start with ; then anything after 0 or more times .* (if you require a character change to .+) followed by your ending characters [;,)%&@]. To exclude them you need to add a positive lookahead ?(?=[;,)%&@]). This as the name suggests looks ahead one character and tries to match it to your sequence

    For a final regex:

    ;.*?(?=[;,)%&@])
    

    or if you require characters in between:

    ;.+?(?=[;,)%&@])