Search code examples
pythonpython-3.xregex

Tag-like (/x=y/yellow=flower) Pattern-Matching Regex Python


I need to match the information associated with each tag with a regex pattern. The tags here are not HTML but follow the format of: /x=y or

tags = "/yellow=flower/blue=sky"

What would be the regex pattern to yield this information?

I have tried:

linein = "/yellow=flower/blue=sky"
pattern = "^[A-Za-z0-9]{1}^[A-Za-z0-9]{1}"
p2 = re.findall(pattern, linein)

The expected output is:

yellow flower
blue sky

Solution

  • Your attempt has several issues:

    • It doesn't attempt to match /, nor =
    • ^ will (by default) match the start of the input, so having it in the middle of your regex pattern is a guarantee of having no matches. Moreover, you want to match pairs that are not at the start of your input, so there really shouldn't be a ^ in your pattern.
    • {1} tells the regex engine that the preceding pattern should be matched exactly once. This is never necessary to include, since that is the default. Secondly, it doesn't do what you want: you don't want to say that an identifier like "yellow" can consist of only one character. On the contrary, you want to allow multiple characters, and the way to indicate that is with a +.
    • Less of an issue, but [A-Za-z0-9] is almost the same as the much shorter \w. The only difference is that the latter also allows for an underscore character, which I think would be fine. In most contexts identifiers are allowed to include underscores. So use \w instead. To make sure backslashes are passed on as-is to the regex engine, prefix your string literal with r
    • The desired output seems a multiline string. But that is not very handy to work with. You'd want to get a list of pairs, or possibly a dictionary with key/value pairs.

    With the above points taken into account, your code would become:

    import re
    
    linein = "/yellow=flower/blue=sky"
    pattern = r"/(\w+)=(\w+)"
    lst = re.findall(pattern, linein)
    print(lst)  # [('yellow', 'flower'), ('blue', 'sky')]
    print(dict(lst))  # {'yellow': 'flower', 'blue': 'sky'}