Search code examples
pythonregexregex-groupcapture-group

Multiple captures within a string


Haven't found a Q/A on SO that quite answers this situation. I have implemented solutions from some to get as far as I have.

I'm parsing the header (metadata) part of VCF files. Each line has the format:

##TAG=<key=val,key=val,...>

I have a regex that parses the multiple k-v pairs inside the <>, but I can't seem to add in the <> and have it still "work."

s = 'a=1,b=two,c="three"'

pat = re.compile(r'''(?P<key>\w+)=(?P<value>[^,]*),?''')
match = pat.findall(s)
print(dict(match))
#{'a': '1', 'b': 'two', 'c': '"three"'}

Also,

s = 'a=1,b=two,c="three"'

pat = re.compile(r'''(?:(?P<key>\w+)=(?P<value>[^,]*),?)''')
match = pat.findall(s)
print(match)
print(dict(match))
#[('a', '1'), ('b', 'two'), ('c', '"three"')]
#{'a': '1', 'b': 'two', 'c': '"three"'}

So, I thought I might be able to do:

s = '<a=1,b=two,c="three">'

pat = re.compile(r'''<(?:(?P<key>\w+)=(?P<value>[^,]*),?)>''')
match = pat.findall(s)
print(match)
print(dict(match))
#[]
#{}

If possible, I'd really like to do something like:

\#\#(?P<tag>)=<(?:(?P<key>\w+)=(?P<value>[^,]*),?)>

and capture the TAG and all of the k-v pairs. And obviously, I'd like it to "work."

I realize the "proper" solution here is likely to use a parser rather than regex. But I'm a bioinformatics person, not a programmer. And the format is very consistent and laid out in a standardized specification that is (almost) always followed.


Solution

  • With PyPi regex:

    import regex
    s = '##TAG=<key=val,key2=val2>'
    pat = regex.compile(r'''##(?P<tag>\w+)=<(?:(?P<key>\w+)=(?P<value>[^,<>]*),?)*>''')
    match = pat.search(s)
    print([match.group("tag"), list(zip(match.captures("key"), match.captures("value")))])
    

    See Python proof | Regex explanation

    --------------------------------------------------------------------------------
      ##                       '##'
    --------------------------------------------------------------------------------
      (?P<tag>                  group and capture to \k<tag>:
    --------------------------------------------------------------------------------
        \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                                 more times (matching the most amount
                                 possible))
    --------------------------------------------------------------------------------
      )                        end of \k<tag>
    --------------------------------------------------------------------------------
      =<                       '=<'
    --------------------------------------------------------------------------------
      (?:                      group, but do not capture (0 or more times
                               (matching the most amount possible)):
    --------------------------------------------------------------------------------
        (?P<key>                        group and capture to \k<key>:
    --------------------------------------------------------------------------------
          \w+                      word characters (a-z, A-Z, 0-9, _) (1
                                   or more times (matching the most
                                   amount possible))
    --------------------------------------------------------------------------------
        )                        end of \k<key>
    --------------------------------------------------------------------------------
        =                        '='
    --------------------------------------------------------------------------------
        (?P<value>                 group and capture to \k<value>:
    --------------------------------------------------------------------------------
          [^,<>]*                  any character except: ',', '<', '>' (0
                                   or more times (matching the most
                                   amount possible))
    --------------------------------------------------------------------------------
        )                        end of \k<value>
    --------------------------------------------------------------------------------
        ,?                       ',' (optional (matching the most amount
                                 possible))
    --------------------------------------------------------------------------------
      )*                       end of grouping
    --------------------------------------------------------------------------------
      >                        '>'
    

    Results: ['TAG', [('key', 'val'), ('key2', 'val2')]]