Search code examples
pythonevent-handlingyamlpyyaml

YAML list -> Python generator?


I was wondering whether there is an easy way to parse a YAML document consisting of a list of items as a python generator using PyYAML.

For example, given the file

# foobar.yaml
---
- foo: ["bar", "baz", "bah"]
  something_else: blah
- bar: yet_another_thing

I'd like to be able to do something like

for item in yaml.load_as_generator(open('foobar.yaml')): # does not exist
    print(str(item))

I know there is yaml.load_all, which can achieve similar functionality, but then you need to treat each record as its own document. The reason why I'm asking is because I have some really big files that I'd like to convert to YAML and then parse with a low memory footprint.

I took a look at the PyYAML Events API but it scared me =)


Solution

  • I can understand that the Events API scares you, and it would only bring you so much. First of all you would need to keep track of depth (because you have your top level complex sequence items, as well as "bar", "baz" etc. And, having cut the low level sequence event elements correctly you would have to feed them into the composer to create nodes (and eventually Python objects), not trivial either.

    But since YAML uses indentation, even for scalars spanning multiple lines, you can use a simple line based parser that recognises where each sequence element starts and feed those into the normal load() function one at a time:

    #/usr/bin/env python
    
    import ruamel.yaml
    
    def list_elements(fp, depth=0):
        buffer = None
        in_header = True
        list_element_match = ' ' * depth + '- '
        for line in fp:
            if line.startswith('---'):
                in_header = False
                continue
            if in_header:
                continue
            if line.startswith(list_element_match):
                if buffer is None:
                    buffer = line
                    continue
                yield ruamel.yaml.load(buffer)[0]
                buffer = line
                continue
            buffer += line
        if buffer:
           yield ruamel.yaml.load(buffer)[0]
    
    
    with open("foobar.yaml") as fp:
       for element in list_elements(fp):
           print(str(element))
    

    resulting in:

    {'something_else': 'blah', 'foo': ['bar', 'baz', 'bah']}
    {'bar': 'yet_another_thing'}
    

    I used the enhanced version of PyYAML, ruamel.yaml here (of which I am the author), but PyYAML should work in the same way.