Search code examples
pythonregexregex-group

Regex Capturing Group


Say I have this dummy URL and I need to extract plants and their colors as capture groups

https://flowers.com/compare._plant1.green.402992_plant2.yellow.402228_plant3.red.403010_plant4.orange.399987.html

The following regex I have is capturing the elements I need as intended, but fails to capture anything when I have less than 4 plants in the URL. There a link to a regex tester at the bottom with sample code and URL that you can play with.

How do I modify this regex to work dynamically such that it captures what's available without requiring a static URL structure. For now, assume I am only capturing at most 4 plants (8 groups)

(flowers\.com)\/compare\._(?:([^.]+)\.([^.]+)).*_(?:([^.]+)\.([^.]+)).*_(?:([^.]+)\.([^.]+)).*_(?:([^.]+)\.([^.]+))

enter image description here

https://regex101.com/r/prjAO7/2


Solution

  • You could match the first plant and make the second, third and fourth one optional using a question mark non capturing group (?:..)?

    Instead of using .* you might also match a dot and 1+ digits instead using \.\d+ to prevent unnecessary backtracking.

    (flowers\.com)\/compare\._([^.]+)\.([^.]+)\.\d+(?:_([^.]+)\.([^.]+)\.\d+)?(?:_([^.]+)\.([^.]+)\.\d+)?(?:_([^.]+)\.([^.]+)\.\d+)?
    

    Regex demo


    Another option is to parse the url if you already know it is the flowers.com url and get the path. If the parts for the flowers are structured in the same way, you might also use a single part of the pattern _([^.]+)\.([^.]+)\.\d+

    Python demo

    For example

    from urllib.parse import urlparse
    import re
    
    pattern = r"_([^.]+)\.([^.]+)\.\d+"
    
    o = urlparse('https://flowers.com/compare._plant1.green.402992_plant2.yellow.402228_plant3.red.403010_plant4.orange.399987.html')
    print(re.findall(pattern, o.path))
    

    Output

    [('plant1', 'green'), ('plant2', 'yellow'), ('plant3', 'red'), ('plant4', 'orange')]