Search code examples
pythonparsingparsecparser-combinators

Recursive parsing with Python's parsec.py library


I have a basic question about parsing using Python's parsec.py library.

I would like to extract the date somewhere inside a text. For e.g,

Lorem ipsum dolor sit amet. A number 42 is present here. But here is a date 11/05/2017. Can you extract this?

or

Lorem ipsum dolor sit amet.
A number 42 is present here.

But here is a date 11/05/2017. Can you extract this?

In both cases I want the parser to return 11/05/2017.

I only want to use parsec.py parsing library and I don't want to use regex. parsec's built in regex function is okay.

I tried something like

from parsec import *

ss = "Lorem ipsum dolor sit amet. A number 42 is present here. But here is a date 11/05/2017. Can you extract this?"

date_parser = regex(r'[0-9]{2}/[0-9]{2}/[0-9]{4}')

date = date_parser.parse(ss)

I get ParseError: expected [0-9]{2}/[0-9]{2}/[0-9]{4} at 0:0

Is there a way to ignore the text until the date_parser pattern has reached? Without erroring?


Solution

  • What you want is a parser which skip any unmatched chars, then parse a regex pattern followed.

    The date pattern could be defined with regex parser,

    date_pattern = regex(r'[0-9]{2}/[0-9]{2}/[0-9]{4}')
    

    We first define a parser which consumle an arbitrary char (which would be included in the library (edit: has been included in v3.9)),

    def any():
        '''Parse a random character.'''
        @Parser
        def any_parser(text, index=0):
            if index < len(text):
                return Value.success(index + 1, text[index])
            else:
                return Value.failure(index, 'a random char')
        return any_parser
    

    To express the idea about "skip any chars and match a pattern", we need to define a recursive parser as

    date_parser = date_pattern ^ (any() >> date_parser)
    

    But it is not a valid python expression, thus we need

    @generate
    def date_with_prefix():
        matched = yield(any() >> date_parser)
        return matched
    
    date_parser = date_pattern ^ date_with_prefix
    

    (Here the combinator ^ means try_choice, you could find it in the docs.)

    Then it would work as expected:

    >>> date_parser.parse("Lorem ipsum dolor sit amet.")
    ---------------------------------------------------------------------------
    ParseError                                Traceback (most recent call last)
    ...
    
    ParseError: expected date_with_prefix at 0:27
    
    >>> date_parser.parse("A number 42 is present here.")
    ---------------------------------------------------------------------------
    ParseError                                Traceback (most recent call last)
    ...
    
    ParseError: expected date_with_prefix at 0:28
    
    >>> date_parser.parse("But here is a date 11/05/2017. Can you extract this?")
    '11/05/2017'
    

    To avoid the expection on invalid input and returns a None instead, you could define it as an optional parser:

    date_parser = optional(date_pattern ^ date_with_prefix)