Search code examples
pythonpandastqdm

Python Pandas/tqdm show progress for extract


I have a huge pandas Series of 1007 million string rows. I run a regex extract on it(so the task is row order independent, it can be run in parallel) which takes a few hours and looks like this

df["big_string_column"].str.extract(r"Name: (.*), Value: (.*)") 

or

df["big_string_column"].str.extractall(r"Name: (.*), Value: (.*)")

this returns a new DataFrame with the 2 capture groups and columns.

Is there a way to use tqdm or something else to show progress for this? :)

Can this be refactored into dataframe.progress_apply which retains the capture groups from regex without major performance hits(since pd.Series.str.extract optimizes the regex) or is there a completely different approach?


Solution

  • I'm not aware of any progress report functionality back from .str.extract. Changing it into an .apply to use .progress_apply might come with bad performance penalties.

    It's neither pretty nor a one-liner, but if the work being done is row-independent (no grouping), you can always just split the df up into chunks, do the work independently on the chunks, and merge them back together at the end. You can then track the progress by chunk with tqdm.

    Something like this:

    # 1000 sections as an example, may need to adapt to your problem
    chunks = np.array_split(df, 1000)
    
    processed = []
    for chunk in tqdm(chunks):
        processed.append(chunk.str.extract(r"Name: (.*), Value: (.*)"))
    
    out = pd.concat(processed)