Search code examples
pythonregexmultilinestring

Parsing colon delimited data


I have the following text chunk:

string = """
    apples: 20
    oranges: 30
    ripe: yes
    farmers:
            elmer fudd
                   lives in tv
            farmer ted
                   lives close
            farmer bill
                   lives far
    selling: yes
    veggies:
            carrots
            potatoes
    """

I am trying to find a good regex that will allow me to parse out the key values. I can grab the single line key values with something like:

'(.+?):\s(.+?)\n'

However, the problem comes when I hit farmers, or veggies.

Using the re flags, I need to do something like:

re.findall( '(.+?):\s(.+?)\n', string, re.S), 

However, I am having a heck of a time grabbing all of the values associated with farmers.

There is a newline after each value, and a tab, or series of tabs before the values when they are multiline.

and goal is to have something like:

{ 'apples': 20, 'farmers': ['elmer fudd', 'farmer ted'] }

etc.

Thank you in advance for your help.


Solution

  • Here's a really dumb parser that takes into account your (apparent) indentation rules:

    def parse(s):
        d = {}
        lastkey = None
        for fullline in s:
            line = fullline.strip()
            if not line:
                pass
            elif ':' not in line:
                indent = len(fullline) - len(fullline.lstrip())
                if lastindent is None:
                    lastindent = indent
                if lastindent == indent:
                    lastval.append(line)
            else:
                if lastkey:
                    d[lastkey] = lastval
                    lastkey = None
                if line.endswith(':'):
                    lastkey, lastval, lastindent = key, [], None
                else:
                    key, _, value = line.partition(':')
                    d[key] = value.strip()
        if lastkey:
            d[lastkey] = lastval
            lastkey = None
        return d
    
    import pprint
    pprint(parse(string.splitlines()))
    

    The output is:

    {'apples': '20',
     'oranges': '30',
     'ripe': ['elmer fudd', 'farmer ted', 'farmer bill'],
     'selling': ['carrots', 'potatoes']}
    

    I think this is already complicated enough that it would look cleaner as an explicit state machine, but I wanted to write this in terms that any novice could understand.