How to use yaml.load_all with fileinput.input?

Without resorting to ''.join, is there a Pythonic way to use PyYAML's yaml.load_all with fileinput.input() for easy streaming of multiple documents from multiple sources?

I'm looking for something like the following (non-working example):

# example.py
import fileinput

import yaml

for doc in yaml.load_all(fileinput.input()):
    print(doc)

Expected output:

$ cat >pre.yaml <<<'--- prefix-doc'
$ cat >post.yaml <<<'--- postfix-doc'
$ python example.py pre.yaml - post.yaml <<<'--- hello'
prefix-doc
hello
postfix-doc

Of course, yaml.load_all expects either a string, bytes, or a file-like object and fileinput.input() is none of those things, so the above example does not work.

Actual output:

$ python example.py pre.yaml - post.yaml <<<'--- hello'
...
AttributeError: FileInput instance has no attribute 'read'

You can make the example work with ''.join, but that's cheating. I'm looking for a way that does not read the entire stream into memory at once.

We might rephrase the question as Is there some way to emulate a string, bytes, or file-like object that proxies to an underlying iterator of strings? However, I doubt that yaml.load_all actually needs the entire file-like interface, so that phrasing would ask for more than is strictly necessary.

Ideally I'm looking for the minimal adapter that would support something like this:

for doc in yaml.load_all(minimal_adapter(fileinput.input())):
    print(doc)

Solution

Your minimal_adapter should take a fileinput.FileInput as a parameter and return an object which load_all can use. load_all either takes as an argument a string, but that would require concatenating the input, or it expects the argument to have a read() method.

Since your minimal_adapter needs to preserve some state, I find it clearest/easiest to implement it as an instance of a class that has a __call__ method, and have that method return the instance and store its argument for future use. Implemented that way, the class should also have a read() method, as this will be called after handing the instance to load_all:

import fileinput
import ruamel.yaml


class MinimalAdapter:
    def __init__(self):
        self._fip = None
        self._buf = None  # storage of read but unused material, maximum one line

    def __call__(self, fip):
        self._fip = fip  # store for future use
        self._buf = ""
        return self

    def read(self, size):
        if len(self._buf) >= size:
            # enough in buffer from last read, just cut it off and return
            tmp, self._buf = self._buf[:size], self._buf[size:]
            return tmp
        for line in self._fip:
            self._buf += line
            if len(self._buf) > size:
                break
        else:
            # ran out of lines, return what we have
            tmp, self._buf = self._buf, ''
            return tmp
        tmp, self._buf = self._buf[:size], self._buf[size:]
        return tmp


minimal_adapter = MinimalAdapter()

for doc in ruamel.yaml.load_all(minimal_adapter(fileinput.input())):
    print(doc)

With this, running your example invocation exactly gives the output that you want.

This is probably only more memory efficient for larger files. The load_all tries to read 1024 byte blocks at a time (easily found out by putting a print statement in MinimalAdapter.read()) and fileinput does some buffering as well (use strace if your interested to find out how it behaves).

_{This was done using ruamel.yaml a YAML 1.2 parser, of which I am the author. This should work for PyYAML, of which ruamel.yaml is a derived superset, as well.}