I have a huge pandas Series of 1007 million string rows. I run a regex extract on it(so the task is row order independent, it can be run in parallel) which takes a few hours and looks like this
df["big_string_column"].str.extract(r"Name: (.*), Value: (.*)")
or
df["big_string_column"].str.extractall(r"Name: (.*), Value: (.*)")
this returns a new DataFrame with the 2 capture groups and columns.
Is there a way to use tqdm or something else to show progress for this? :)
Can this be refactored into dataframe.progress_apply which retains the capture groups from regex without major performance hits(since pd.Series.str.extract optimizes the regex) or is there a completely different approach?
I'm not aware of any progress report functionality back from .str.extract
. Changing it into an .apply
to use .progress_apply
might come with bad performance penalties.
It's neither pretty nor a one-liner, but if the work being done is row-independent (no grouping), you can always just split the df
up into chunks, do the work independently on the chunks, and merge them back together at the end. You can then track the progress by chunk with tqdm.
Something like this:
# 1000 sections as an example, may need to adapt to your problem
chunks = np.array_split(df, 1000)
processed = []
for chunk in tqdm(chunks):
processed.append(chunk.str.extract(r"Name: (.*), Value: (.*)"))
out = pd.concat(processed)