Search code examples
pythonfor-loopparsingprogress-bar

Python: Progress bar in parse function?


I have previously managed to set up a progress bar with tdqm for a simple for-loop successfully, but am now trying to do something slightly different:

I have an xml-file with several items in it that I am parsing to a function to extract specific information which I then convert to a dataframe. So I have a function that looks roughly like this:

def parse_record(xml):
      
    ns = {"marc":"http://www.loc.gov/MARC21/slim"}

    #ID:      
    id = xml.findall("marc:controlfield[@tag = '001']", namespaces=ns)
    try:
        id = id[0].text
    except:
        id = 'fail'
        
    #Creator: 
    creator = xml.findall("marc:datafield[@tag = '100']/marc:subfield[@code = 'a']", 
         namespaces=ns)

    if creator:
        creator = creator[0].text
    else:
        creator = "fail"

    gathered = {'ID':id, 'Creator':creator}
    
    return gathered

I then call this function looping through all the single items in the main xml-file and convert it to a dataframe:

result = [parse_record(item) for item in records]
df = pd.DataFrame(result)
df

This all works fine, but I am not sure how to manage to get a progress bar included into the whole thing, since the for-loop isn't on its own.

If I add the tdqm bit to the function, it obviously only ever counts to 1, but does this hundreds of times (depending on how many items the xml-file includes). I haven't managed to include it to the parsing part.

Any help would be much appreciated!


Solution

  • You pretty much just need to break up your list comprehension. I'll use Enlighten here but you can accomplish the same thing with tqdm.

    import enlighten
    
    records: list = ...
    
    manager = enlighten.get_manager()
    pbar = manager.counter(total=len(records), desc='Parsing records', unit='records')
    
    result = []
    for item in records:
        result.append(parse_record(item))
        pbar.update()
    
    df = pd.DataFrame(result)
    

    If records is a generator not an iterable, you'll need to wrap it with list() or tuple() first so you can get the length.