pyparsing recursive grammar space separated list inside a comma separated list

Have the following string that I'd like to parse:

((K00134,K00150) K00927,K11389) (K00234,K00235)

each step is separated by a space and alternation is represented by a comma. I'm stuck in the first part of the string where there is a space inside the brackets. The desired output I'm looking for is:

[[['K00134', 'K00150'], 'K00927'], 'K11389'], ['K00234', 'K00235']

What I've got so far is a basic setup to do recursive parsing, but I'm stumped on how to code in a space separated list into the bracket expression

from pyparsing import Word, Literal, Combine, nums, \
    Suppress, delimitedList, Group, Forward, ZeroOrMore

ortholog = Combine(Literal('K') + Word(nums, exact=5))
exp = Forward()
ortholog_group = Suppress('(') + Group(delimitedList(ortholog)) + Suppress(')')
atom = ortholog | ortholog_group | Group(Suppress('(') + exp + Suppress(')'))
exp <<= atom + ZeroOrMore(exp)

Solution

You are on the right track, but I think you only need one place where you include grouping with ()'s, not two.

import pyparsing as pp 

LPAR,RPAR = map(pp.Suppress, "()")
ortholog = pp.Combine('K' + pp.Word(pp.nums, exact=5))

ortholog_group = pp.Forward()
ortholog_group <<= pp.Group(LPAR + pp.OneOrMore(ortholog_group | pp.delimitedList(ortholog)) + RPAR)
expr = pp.OneOrMore(ortholog_group)

tests = """\
((K00134,K00150) K00927,K11389) (K00234,K00235)
"""
expr.runTests(tests)

gives:

((K00134,K00150) K00927,K11389) (K00234,K00235)
[[['K00134', 'K00150'], 'K00927', 'K11389'], ['K00234', 'K00235']]
[0]:
  [['K00134', 'K00150'], 'K00927', 'K11389']
  [0]:
    ['K00134', 'K00150']
  [1]:
    K00927
  [2]:
    K11389
[1]:
  ['K00234', 'K00235']

This is not exactly what you said you were looking for:

you wanted: [[['K00134', 'K00150'], 'K00927'], 'K11389'], ['K00234', 'K00235']
I output  : [[['K00134', 'K00150'], 'K00927', 'K11389'], ['K00234', 'K00235']]

I'm not sure why there is grouping in your desired output around the space-separated part (K00134,K00150) K00927. Is this your intention or a typo? If intentional, you'll need to rework the definition of ortholog_group, something that will do a delimited list of space-delimited groups in addition to the grouping at parens. The closest I could get was this:

[[[[['K00134', 'K00150']], 'K00927'], ['K11389']], [['K00234', 'K00235']]]

which required some shenanigans to group on spaces, but not group bare orthologs when grouped with other groups. Here is what it looked like:

ortholog_group <<= pp.Group(LPAR + pp.delimitedList(pp.Group(ortholog_group*(1,) & ortholog*(0,))) + RPAR) | pp.delimitedList(ortholog)

The & operator in combination with the repetition operators gives the space-delimited grouping (*(1,) is equivalent to OneOrMore, *(0,) with ZeroOrMore, but also supports *(10,) for "10 or more", or *(3,5) for "at least 3 and no more than 5"). This too is not quite exactly what you asked for, but may get you closer if indeed you need to group the space-delimited bits.

But I must say that grouping on spaces is ambiguous - or at least confusing. Should "(A,B) C D" be [[A,B],C,D] or [[A,B],C],[D] or [[A,B],[C,D]]? I think, if possible, you should permit comma-delimited lists, and perhaps space-delimited also, but require the ()'s when items should be grouped.