Search code examples
recursionpyparsing

pyparsing recursive grammar space separated list inside a comma separated list


Have the following string that I'd like to parse:

((K00134,K00150) K00927,K11389) (K00234,K00235)

each step is separated by a space and alternation is represented by a comma. I'm stuck in the first part of the string where there is a space inside the brackets. The desired output I'm looking for is:

[[['K00134', 'K00150'], 'K00927'], 'K11389'], ['K00234', 'K00235']

What I've got so far is a basic setup to do recursive parsing, but I'm stumped on how to code in a space separated list into the bracket expression

from pyparsing import Word, Literal, Combine, nums, \
    Suppress, delimitedList, Group, Forward, ZeroOrMore

ortholog = Combine(Literal('K') + Word(nums, exact=5))
exp = Forward()
ortholog_group = Suppress('(') + Group(delimitedList(ortholog)) + Suppress(')')
atom = ortholog | ortholog_group | Group(Suppress('(') + exp + Suppress(')'))
exp <<= atom + ZeroOrMore(exp)

Solution

  • You are on the right track, but I think you only need one place where you include grouping with ()'s, not two.

    import pyparsing as pp 
    
    LPAR,RPAR = map(pp.Suppress, "()")
    ortholog = pp.Combine('K' + pp.Word(pp.nums, exact=5))
    
    ortholog_group = pp.Forward()
    ortholog_group <<= pp.Group(LPAR + pp.OneOrMore(ortholog_group | pp.delimitedList(ortholog)) + RPAR)
    expr = pp.OneOrMore(ortholog_group)
    
    tests = """\
    ((K00134,K00150) K00927,K11389) (K00234,K00235)
    """
    expr.runTests(tests)
    

    gives:

    ((K00134,K00150) K00927,K11389) (K00234,K00235)
    [[['K00134', 'K00150'], 'K00927', 'K11389'], ['K00234', 'K00235']]
    [0]:
      [['K00134', 'K00150'], 'K00927', 'K11389']
      [0]:
        ['K00134', 'K00150']
      [1]:
        K00927
      [2]:
        K11389
    [1]:
      ['K00234', 'K00235']
    

    This is not exactly what you said you were looking for:

    you wanted: [[['K00134', 'K00150'], 'K00927'], 'K11389'], ['K00234', 'K00235']
    I output  : [[['K00134', 'K00150'], 'K00927', 'K11389'], ['K00234', 'K00235']]
    

    I'm not sure why there is grouping in your desired output around the space-separated part (K00134,K00150) K00927. Is this your intention or a typo? If intentional, you'll need to rework the definition of ortholog_group, something that will do a delimited list of space-delimited groups in addition to the grouping at parens. The closest I could get was this:

    [[[[['K00134', 'K00150']], 'K00927'], ['K11389']], [['K00234', 'K00235']]]
    

    which required some shenanigans to group on spaces, but not group bare orthologs when grouped with other groups. Here is what it looked like:

    ortholog_group <<= pp.Group(LPAR + pp.delimitedList(pp.Group(ortholog_group*(1,) & ortholog*(0,))) + RPAR) | pp.delimitedList(ortholog)
    

    The & operator in combination with the repetition operators gives the space-delimited grouping (*(1,) is equivalent to OneOrMore, *(0,) with ZeroOrMore, but also supports *(10,) for "10 or more", or *(3,5) for "at least 3 and no more than 5"). This too is not quite exactly what you asked for, but may get you closer if indeed you need to group the space-delimited bits.

    But I must say that grouping on spaces is ambiguous - or at least confusing. Should "(A,B) C D" be [[A,B],C,D] or [[A,B],C],[D] or [[A,B],[C,D]]? I think, if possible, you should permit comma-delimited lists, and perhaps space-delimited also, but require the ()'s when items should be grouped.