Search code examples
regexpython-3.xregex-greedy

Python Regex Key Value Matching


I have a file I am trying to parse that includes key value pairs. Where the key starts with a "-" followed by alpha characters and the value proceeding it as depicted in the picture below.

When I parse the file with the below regex pattern I am easily able to get keys and values but when values include multiple words or quoted data (that also matches key value) my pattern matching is failing. I've tried multiple iterations of regex pattern matching but failing to get the desired output. I managed to find a regex pattern to match the quoted text'"(.*?)"' but unable to use both patterns at the same time. Any help to get the desired output below is much appreciated.

Keys and Values

My Code (Desired results for first line only):

mystring = '''-desc none -type used -cost med -color blue
-desc none -msg This is a a message -name test
-desc "(-type old -cost high)" -color green'''

mydict = {}
item_num = 0
for line in mystring.splitlines():
    quoted = re.findall('"(.*?)"', line)
    key_value = re.findall('(-\w+\s+)(\S+)', line)
    print(key_value)

### Output ###
[('-desc ', 'none'), ('-type ', 'used'), ('-cost ', 'med'), ('-color ', 'blue')]
[('-desc ', 'none'), ('-msg ', 'This'), ('-name ', 'test')]
[('-desc ', '"(-type'), ('-cost ', 'high)"'), ('-color ', 'green')]

### Desired Output ###
[('-desc ', 'none'), ('-type ', 'used'), ('-cost ', 'med'), ('-color ', 'blue')]
[('-desc ', 'none'), ('-msg ', 'This is a message'), ('-name ', 'test')]
[('-desc ', "(-type old -cost high)"), ('-color ', 'green')]

Solution

  • This is the best regex you could use:
    It's never too late to change your vote.

    regex raw:

    (?<!\S)-(\w+)\s+("[^"]*"|[^\s"-]+(?:\s+[^\s"-]+)*)(?!\S)
    

    python raw:

    r"(?<!\S)-(\w+)\s+(\"[^\"]*\"|[^\s\"-]+(?:\s+[^\s\"-]+)*)(?!\S)"
    

    https://regex101.com/r/7bYN1A/1

    Key = group 1
    Value = group 2

     (?<! \S )
     -
     ( \w+ )                       # (1)
     \s+ 
     (                             # (2 start)
          " [^"]* "
       |  [^\s"-]+ 
          (?: \s+ [^\s"-]+ )*
     )                             # (2 end)
     (?! \S )
    

    Benchmark

    Regex1:   (?<!\S)-(\w+)\s+("[^"]*"|[^\s"-]+(?:\s+[^\s"-]+)*)(?!\S)
    Options:  < none >
    Completed iterations:   50  /  50     ( x 1000 )
    Matches found per iteration:   10
    Elapsed Time:    1.66 s,   1660.05 ms,   1660048 µs
    Matches per sec:   301,196