Search code examples
pythonregexpython-3.xregex-groupregex-greedy

RegEx for replacing all groups with one row


For example, I have this string:

<ul><li><ahref="http://test.com">sometext</a></li></ul>

and I want this output:

<ul><li>[URL href="http://test.com"]sometext[/URL]</li></ul>

So I create this regex, to matches <ahref - first group, "> - second group and </a> - third group, to replace them with [URL for first group, "] for second group and [/URL] for third group:

pattern = r'(<a ?href).+(">).+(<\/a>)'

It matches the groups, but now I don't know how to replace them.


Solution

  • Here, we would capture what we wish to replace using 4 capturing groups, with an expression similar to:

    (<ul><li>)<a\s+href=\"(.+?)\">(.+?)<\/a>(<\/li><\/ul>)
    

    Demo 1

    For missing space, we would simply use:

    (<ul><li>)<ahref=\"(.+?)\">(.+?)<\/a>(<\/li><\/ul>)
    

    Demo 2

    If we might have both instances, we would add an optional space group using a capturing or non-capturing group:

    (<ul><li>)<a(\s+)?href=\"(.+?)\">(.+?)<\/a>(<\/li><\/ul>)
    

    Demo 3

    Test

    # coding=utf8
    # the above tag defines encoding for this document and is for Python 2.x compatibility
    
    import re
    
    regex = r"(<ul><li>)<a\s+href=\"(.+?)\">(.+?)<\/a>(<\/li><\/ul>)"
    
    test_str = "<ul><li><a href=\"http://test.com\">sometext</a></li></ul>
    "
    
    subst = "\\1[URL href=\"\\2\"]\\3[/URL]\\4"
    
    # You can manually specify the number of replacements by changing the 4th argument
    result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
    
    if result:
        print (result)
    
    # Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
    

    RegEx Circuit

    jex.im visualizes regular expressions:

    enter image description here