Search code examples
pythonpyyaml

How to use yaml.load_all with fileinput.input?


Without resorting to ''.join, is there a Pythonic way to use PyYAML's yaml.load_all with fileinput.input() for easy streaming of multiple documents from multiple sources?

I'm looking for something like the following (non-working example):

# example.py
import fileinput

import yaml

for doc in yaml.load_all(fileinput.input()):
    print(doc)

Expected output:

$ cat >pre.yaml <<<'--- prefix-doc'
$ cat >post.yaml <<<'--- postfix-doc'
$ python example.py pre.yaml - post.yaml <<<'--- hello'
prefix-doc
hello
postfix-doc

Of course, yaml.load_all expects either a string, bytes, or a file-like object and fileinput.input() is none of those things, so the above example does not work.

Actual output:

$ python example.py pre.yaml - post.yaml <<<'--- hello'
...
AttributeError: FileInput instance has no attribute 'read'

You can make the example work with ''.join, but that's cheating. I'm looking for a way that does not read the entire stream into memory at once.

We might rephrase the question as Is there some way to emulate a string, bytes, or file-like object that proxies to an underlying iterator of strings? However, I doubt that yaml.load_all actually needs the entire file-like interface, so that phrasing would ask for more than is strictly necessary.

Ideally I'm looking for the minimal adapter that would support something like this:

for doc in yaml.load_all(minimal_adapter(fileinput.input())):
    print(doc)

Solution

  • Your minimal_adapter should take a fileinput.FileInput as a parameter and return an object which load_all can use. load_all either takes as an argument a string, but that would require concatenating the input, or it expects the argument to have a read() method.

    Since your minimal_adapter needs to preserve some state, I find it clearest/easiest to implement it as an instance of a class that has a __call__ method, and have that method return the instance and store its argument for future use. Implemented that way, the class should also have a read() method, as this will be called after handing the instance to load_all:

    import fileinput
    import ruamel.yaml
    
    
    class MinimalAdapter:
        def __init__(self):
            self._fip = None
            self._buf = None  # storage of read but unused material, maximum one line
    
        def __call__(self, fip):
            self._fip = fip  # store for future use
            self._buf = ""
            return self
    
        def read(self, size):
            if len(self._buf) >= size:
                # enough in buffer from last read, just cut it off and return
                tmp, self._buf = self._buf[:size], self._buf[size:]
                return tmp
            for line in self._fip:
                self._buf += line
                if len(self._buf) > size:
                    break
            else:
                # ran out of lines, return what we have
                tmp, self._buf = self._buf, ''
                return tmp
            tmp, self._buf = self._buf[:size], self._buf[size:]
            return tmp
    
    
    minimal_adapter = MinimalAdapter()
    
    for doc in ruamel.yaml.load_all(minimal_adapter(fileinput.input())):
        print(doc)
    

    With this, running your example invocation exactly gives the output that you want.

    This is probably only more memory efficient for larger files. The load_all tries to read 1024 byte blocks at a time (easily found out by putting a print statement in MinimalAdapter.read()) and fileinput does some buffering as well (use strace if your interested to find out how it behaves).


    This was done using ruamel.yaml a YAML 1.2 parser, of which I am the author. This should work for PyYAML, of which ruamel.yaml is a derived superset, as well.