I have previously managed to set up a progress bar with tdqm for a simple for-loop successfully, but am now trying to do something slightly different:
I have an xml-file with several items in it that I am parsing to a function to extract specific information which I then convert to a dataframe. So I have a function that looks roughly like this:
def parse_record(xml):
ns = {"marc":"http://www.loc.gov/MARC21/slim"}
#ID:
id = xml.findall("marc:controlfield[@tag = '001']", namespaces=ns)
try:
id = id[0].text
except:
id = 'fail'
#Creator:
creator = xml.findall("marc:datafield[@tag = '100']/marc:subfield[@code = 'a']",
namespaces=ns)
if creator:
creator = creator[0].text
else:
creator = "fail"
gathered = {'ID':id, 'Creator':creator}
return gathered
I then call this function looping through all the single items in the main xml-file and convert it to a dataframe:
result = [parse_record(item) for item in records]
df = pd.DataFrame(result)
df
This all works fine, but I am not sure how to manage to get a progress bar included into the whole thing, since the for-loop isn't on its own.
If I add the tdqm bit to the function, it obviously only ever counts to 1, but does this hundreds of times (depending on how many items the xml-file includes). I haven't managed to include it to the parsing part.
Any help would be much appreciated!
You pretty much just need to break up your list comprehension. I'll use Enlighten here but you can accomplish the same thing with tqdm.
import enlighten
records: list = ...
manager = enlighten.get_manager()
pbar = manager.counter(total=len(records), desc='Parsing records', unit='records')
result = []
for item in records:
result.append(parse_record(item))
pbar.update()
df = pd.DataFrame(result)
If records
is a generator not an iterable, you'll need to wrap it with list()
or tuple()
first so you can get the length.