Using PyYAML
, with CLoader
as the YAML
parser, I am attempting to load the YAML
file, parse it and then write it to a separate file.
For testing purposes, I am using a very large YAML
file, larger than 1GB
.
I am trying to include a progress bar to be displayed in the command line to show that my Python script is running and estimate how long it takes.
Here is my current code:
import yaml
import argparse
from tqdm import tqdm
from yaml import CLoader as Loader
def main():
parser = argparse.ArgumentParser(description='Takes in YAML files and uploads straight to Neo4J database')
parser.add_argument('-f', '--files', nargs='+', metavar='', required=True,
help='<Required> One or more YAML files to upload')
args = parser.parse_args()
for file_name in args.files:
with open(file_name, 'r') as stream:
print("Reading input file...")
with open('test2.txt', 'w') as wf:
print("Writing to output file...")
try:
for data in tqdm(yaml.load(stream, Loader=Loader)):
wf.write(data.get('primaryName') + '\n')
wf.write('++++++++++\n')
except yaml.YAMLError as exc:
print(exc)
if __name__ == "__main__":
main()
What happens now is that there is a tqdm
progress bar displayed for the data writing loop but not for the yaml.load()
process which is the process that is taking the most time.
That is, for a long time, no progress bar is shown until the YAML
file is fully loaded.
I am hoping to find a solution such that I am able to wrap a progress bar around a function that I have no access to, in this case, yaml.load()
.
Am I doing something wrong? Any advice will be great and appreciated.
No, there's no way to wrap a progress bar around code that you have no access to.
Also, you can only use the iterable-based interface to tqdm when you're looping over an iterable, which you aren't here. So you have to use the update
-based interface:
with tqdm(total=100) as pbar:
for i in range(10):
pbar.update(10)
The question is, how do you get PyYAML to call that pbar.update
?
Ideally, you want to find a place to hook the loading process where you can call pbar.update
. If that isn't possible, you'll have to do something ugly (like fork PyYAML
and add to its API, or do the same thing at runtime by monkeypatching it), or switch to a different library. But it ought to be possible.
The obvious option is to create your own subclass of PyYAML.Loader
. The docs for PyYAML explain the API for this class, so you can override any of the methods there to emit some progress and then super
to the base class.
But unfortunately none of them look all that promising. Sure, you can get called once per token, once per event, or once per node, but without knowing how many tokens, events, or nodes there are, this doesn't let you show how far into the file you are. If you want an indeterminate progress spinner, that's fine, but if you can get the actual progress, with an estimate of how long there is to go, and so on, that's always better.
One thing you could do is have your Loader
subclass call tell
on its stream
to figure out how many bytes you've read so far.
I don't have PyYAML on this computer, and the docs are pretty confusing, so you'll probably need to experiment a bit, but it should be something like this:
class ProgressLoader(yaml.CLoader):
def __init__(self, stream, callback):
super().__init__(stream)
# __ because who knows what names the base class is using?
self.__stream = stream
self.__pos = 0
self.__callback = callback
def get_token(self):
result = super().get_token()
pos = self.__stream.tell()
self.__callback(pos - self.__pos)
self.__pos = pos
return result
But then I'm not sure how to get PyYAML to pass your callback into the ProgressLoader
constructor, so you'd have to do something like this:
with open(file_name, 'r') as stream:
size = os.stat(stream.fileno()).st_size
with tqdm(total=size) as progress:
factory = lambda stream: ProgressLoader(stream, progress.update)
data = yaml.load(stream, Loader=factory)
But once we're just going to the file anyway, it's probably easier to not screw around with the confusingly-documented loader types and instead just write a file wrapper.
The docs for file objects are pretty dense, but at least they're clear—and the actual work is pretty simple:
class ProgressFileWrapper(io.TextIOBase):
def __init__(self, file, callback):
self.file = file
self.callback = callback
def read(self, size=-1):
buf = self.file.read(size)
if buf:
self.callback(len(buf))
return buf
def readline(self, size=-1):
buf = self.file.readline(size)
if buf:
self.callback(len(buf))
return buf
Now:
with open(file_name, 'r') as stream:
size = os.stat(stream.fileno()).st_size
with tqdm(total=size) as progress:
wrapper = ProgressFileWrapper(stream, progress.update)
data = yaml.load(wrapper, Loader=Loader)
Of course this isn't perfect. We're assuming here that all of the work is reading the file from disk, not parsing it. That's probably close enough to true that we get away with it, but if it isn't, you'll have one of those progress bars that zips along to almost 100% and then just uselessly stays there for a long time.1
1. Not only is that horribly annoying, it's also so firmly associated with Windows and other Microsoft products that they could probably sue you for stealing their look and feel. :)