I need to match the information associated with each tag with a regex pattern. The tags here are not HTML but follow the format of: /x=y or
tags = "/yellow=flower/blue=sky"
What would be the regex pattern to yield this information?
I have tried:
linein = "/yellow=flower/blue=sky"
pattern = "^[A-Za-z0-9]{1}^[A-Za-z0-9]{1}"
p2 = re.findall(pattern, linein)
The expected output is:
yellow flower
blue sky
Your attempt has several issues:
/
, nor =
^
will (by default) match the start of the input, so having it in the middle of your regex pattern is a guarantee of having no matches. Moreover, you want to match pairs that are not at the start of your input, so there really shouldn't be a ^
in your pattern.{1}
tells the regex engine that the preceding pattern should be matched exactly once. This is never necessary to include, since that is the default. Secondly, it doesn't do what you want: you don't want to say that an identifier like "yellow" can consist of only one character. On the contrary, you want to allow multiple characters, and the way to indicate that is with a +
.[A-Za-z0-9]
is almost the same as the much shorter \w
. The only difference is that the latter also allows for an underscore character, which I think would be fine. In most contexts identifiers are allowed to include underscores. So use \w
instead. To make sure backslashes are passed on as-is to the regex engine, prefix your string literal with r
With the above points taken into account, your code would become:
import re
linein = "/yellow=flower/blue=sky"
pattern = r"/(\w+)=(\w+)"
lst = re.findall(pattern, linein)
print(lst) # [('yellow', 'flower'), ('blue', 'sky')]
print(dict(lst)) # {'yellow': 'flower', 'blue': 'sky'}