Search code examples
pythonparsecparser-combinators

Simply using parsec in python


I'm looking at this library, which has little documentation: https://pythonhosted.org/parsec/#examples

I understand there are alternatives, but I'd like to use this library.

I have the following string I'd like to parse:

mystr = """
<kv>
  key1: "string"
  key2: 1.00005
  key3: [1,2,3]
</kv>
<csv>
date,windspeed,direction
20190805,22,NNW
20190805,23,NW
20190805,20,NE
</csv>"""

While I'd like to parse the whole thing, I'd settle for just grabbing the <tags>. I have:

>>> import parsec
>>> tag_start = parsec.Parser(lambda x: x == "<")
>>> tag_end = parsec.Parser(lambda x: x == ">")
>>> tag_name = parsec.Parser(parsec.Parser.compose(parsec.many1, parsec.letter))
>>> tag_open = parsec.Parser(parsec.Parser.joint(tag_start, tag_name, tag_end))

OK, looks good. Now to use it:

>>> tag_open.parse(mystr)
Traceback (most recent call last):
...
TypeError: <lambda>() takes 1 positional argument but 2 were given

This fails. I'm afraid I don't even understand what it meant about my lambda expression giving two arguments, it's clearly 1. How can I proceed?

My optimal desired output for all the bonus points is:

[
{"type": "tag", 
 "name" : "kv",
 "values"  : [
    {"key1" : "string"},
    {"key2" : 1.00005},
    {"key3" : [1,2,3]}
  ]
},
{"type" : "tag",
"name" : "csv", 
"values" : [
    {"date" : 20190805, "windspeed" : 22, "direction": "NNW"}
    {"date" : 20190805, "windspeed" : 23, "direction": "NW"}
    {"date" : 20190805, "windspeed" : 20, "direction": "NE"}
  ]
}

The output I'd settle for understanding in this question is using functions like those described above for start and end tags to generate:

[
  {"tag": "kv"},
  {"tag" : "csv"}
]

And simply be able to parse arbitrary xml-like tags out of the messy mixed text entry.


Solution

  • I encourage you to define your own parser using those combinators, rather than construct the Parser directly.

    If you want to construct a Parser by wrapping a function, as the documentation states, the fn should accept two arguments, the first is the text and the second is the current position. And fn should return a Value by Value.success or Value.failure, rather than a boolean. You can grep @Parser in the parsec/__init__.py in this package to find more examples of how it works.

    For your case in the description, you could define the parser as follows:

    from parsec import *
    
    spaces = regex(r'\s*', re.MULTILINE)
    name = regex(r'[_a-zA-Z][_a-zA-Z0-9]*')
    
    tag_start = spaces >> string('<') >> name << string('>') << spaces
    tag_stop = spaces >> string('</') >> name << string('>') << spaces
    
    @generate
    def header_kv():
        key = yield spaces >> name << spaces
        yield string(':')
        value = yield spaces >> regex('[^\n]+')
        return {key: value}
    
    @generate
    def header():
        tag_name = yield tag_start
        values = yield sepBy(header_kv, string('\n'))
        tag_name_end = yield tag_stop
        assert tag_name == tag_name_end
        return {
            'type': 'tag',
            'name': tag_name,
            'values': values
        }
    
    @generate
    def body():
        tag_name = yield tag_start
        values = yield sepBy(sepBy1(regex(r'[^\n<,]+'), string(',')), string('\n'))
        tag_name_end = yield tag_stop
        assert tag_name == tag_name_end
        return {
            'type': 'tag',
            'name': tag_name,
            'values': values
        }
    
    parser = header + body
    

    If you run parser.parse(mystr), it yields

    ({'type': 'tag',
      'name': 'kv',
      'values': [{'key1': '"string"'},
                 {'key2': '1.00005'},
                 {'key3': '[1,2,3]'}]},
     {'type': 'tag',
      'name': 'csv',
      'values': [['date', 'windspeed', 'direction'],
                 ['20190805', '22', 'NNW'],
                 ['20190805', '23', 'NW'],
                 ['20190805', '20', 'NE']]}
    )
    

    You can refine the definition of values in the above code to get the result in the exact form you want.