Search code examples
pythonregexregexp-replace

regex dealing with brackets


I have multiple strings like

string1 = """[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''"""
string2 = """[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]""" 
string3 = """[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]"""
strings = [string1, string2, string3]

Every string does contain one or more "[br]"s.

Each string may or may not include annotations.

Every annotation starts with "[*" and ends with "]". It may include double brackets("[[" and "]]"), but never single ones("[" and "]"), so there won't be any confusion (e.g. [* some annotation with [[brackets]]]).

The words I want to replace are the words between the first "[br]" and the annotation(if any exists, otherwise, the end of the string), which are

word1 = """팔짱낄 공''':'''"""
word2 = """낟알 과'''-'''"""
word3 = """둘레 곽[br]클 확"""

So I tried

for string in strings:
    print(re.sub(r"\[br\](.)+?(\[\*)+", "AAAA", string))

expecting something like

[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]

The logic for the regex was

\[br\] : the first "[br]"

(.)+? : one or more characters that I want to replace, lazy

(\[\*)+ : one or more "[*"s

But the result was

[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''
[[顆|{{{#!html}}}]]AAAA some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]]AAAA another annotation.][* another annotation.]

instead. I also tried r"\[br\](.)+?(\[\*)*" but still not working. How can I fix this?


Solution

  • You could use

    ^(.*?\[br]).+?(?=\[\*.*?](?<!].)(?!])|$)
    

    The pattern matches

    • ^ Start of string
    • (.*?\[br]) Capture group 1, match as least as possible chars until the first occurrence of [br]
    • .+? Match any char 1+ times
    • (?= Positive lookahead, assert at the right
      • \[\*.*?](?<!].)(?!]) Match [* till ] not surrounded by ]
      • | Or
      • $ Assert end of string
    • ) Close lookahead

    Replace with capture group 1 and AAAA like \1AAAA

    Regex demo | Python demo

    Example code

    import re
    
    pattern = r"^(.*?\[br]).+?(?=\[\*.*?](?<!].)(?!])|$)"
    
    s = ("[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''\n"
                "[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', \") and brackets(\"(\", \")\", \"[[\", \"]]\").]\n"
                "[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]")
    
    subst = "$1AAAA"
    result = re.sub(pattern, r"\1AAAA", s, 0, re.MULTILINE)
    print(result)
    

    Output

    [[拱|{{{#!html}}}]][br]AAAA
    [[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
    [[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]