Say I have this dummy URL and I need to extract plants and their colors as capture groups
https://flowers.com/compare._plant1.green.402992_plant2.yellow.402228_plant3.red.403010_plant4.orange.399987.html
The following regex
I have is capturing the elements I need as intended, but fails to capture anything when I have less than 4 plants in the URL. There a link to a regex tester at the bottom with sample code and URL that you can play with.
How do I modify this regex to work dynamically such that it captures what's available without requiring a static URL structure. For now, assume I am only capturing at most 4 plants (8 groups)
(flowers\.com)\/compare\._(?:([^.]+)\.([^.]+)).*_(?:([^.]+)\.([^.]+)).*_(?:([^.]+)\.([^.]+)).*_(?:([^.]+)\.([^.]+))
You could match the first plant and make the second, third and fourth one optional using a question mark non capturing group (?:..)?
Instead of using .*
you might also match a dot and 1+ digits instead using \.\d+
to prevent unnecessary backtracking.
(flowers\.com)\/compare\._([^.]+)\.([^.]+)\.\d+(?:_([^.]+)\.([^.]+)\.\d+)?(?:_([^.]+)\.([^.]+)\.\d+)?(?:_([^.]+)\.([^.]+)\.\d+)?
Another option is to parse the url if you already know it is the flowers.com
url and get the path. If the parts for the flowers are structured in the same way, you might also use a single part of the pattern _([^.]+)\.([^.]+)\.\d+
For example
from urllib.parse import urlparse
import re
pattern = r"_([^.]+)\.([^.]+)\.\d+"
o = urlparse('https://flowers.com/compare._plant1.green.402992_plant2.yellow.402228_plant3.red.403010_plant4.orange.399987.html')
print(re.findall(pattern, o.path))
Output
[('plant1', 'green'), ('plant2', 'yellow'), ('plant3', 'red'), ('plant4', 'orange')]