Without resorting to ''.join
, is there a Pythonic way to use PyYAML's yaml.load_all
with fileinput.input()
for easy streaming of multiple documents from multiple sources?
I'm looking for something like the following (non-working example):
# example.py
import fileinput
import yaml
for doc in yaml.load_all(fileinput.input()):
print(doc)
Expected output:
$ cat >pre.yaml <<<'--- prefix-doc'
$ cat >post.yaml <<<'--- postfix-doc'
$ python example.py pre.yaml - post.yaml <<<'--- hello'
prefix-doc
hello
postfix-doc
Of course, yaml.load_all
expects either a string, bytes, or a file-like object and fileinput.input()
is none of those things, so the above example does not work.
Actual output:
$ python example.py pre.yaml - post.yaml <<<'--- hello'
...
AttributeError: FileInput instance has no attribute 'read'
You can make the example work with ''.join
, but that's cheating. I'm looking for a way that does not read the entire stream into memory at once.
We might rephrase the question as Is there some way to emulate a string, bytes, or file-like object that proxies to an underlying iterator of strings? However, I doubt that yaml.load_all
actually needs the entire file-like interface, so that phrasing would ask for more than is strictly necessary.
Ideally I'm looking for the minimal adapter that would support something like this:
for doc in yaml.load_all(minimal_adapter(fileinput.input())):
print(doc)
Your minimal_adapter
should take a fileinput.FileInput
as a parameter and return an object which load_all
can use. load_all
either takes as an argument a string, but that would require concatenating the input, or it expects the argument to have a read()
method.
Since your minimal_adapter needs to preserve some state, I find it clearest/easiest to implement it as an instance of a class that has a __call__
method, and have that method return the instance and store its argument for future use. Implemented that way, the class should also have a read()
method, as this will be called after handing the instance to load_all
:
import fileinput
import ruamel.yaml
class MinimalAdapter:
def __init__(self):
self._fip = None
self._buf = None # storage of read but unused material, maximum one line
def __call__(self, fip):
self._fip = fip # store for future use
self._buf = ""
return self
def read(self, size):
if len(self._buf) >= size:
# enough in buffer from last read, just cut it off and return
tmp, self._buf = self._buf[:size], self._buf[size:]
return tmp
for line in self._fip:
self._buf += line
if len(self._buf) > size:
break
else:
# ran out of lines, return what we have
tmp, self._buf = self._buf, ''
return tmp
tmp, self._buf = self._buf[:size], self._buf[size:]
return tmp
minimal_adapter = MinimalAdapter()
for doc in ruamel.yaml.load_all(minimal_adapter(fileinput.input())):
print(doc)
With this, running your example invocation exactly gives the output that you want.
This is probably only more memory efficient for larger files. The load_all
tries to read 1024 byte blocks at a time (easily found out by putting a print statement in MinimalAdapter.read()
) and fileinput
does some buffering as well (use strace
if your interested to find out how it behaves).
This was done using ruamel.yaml a YAML 1.2 parser, of which I am the author. This should work for PyYAML, of which ruamel.yaml is a derived superset, as well.