By default, PySpark (v3.3.2) is printing progress logs to my jupyter notebook. It is overwriting the TQDM logs that I would to see to keep progress of the estimated computation time.
import pandas as pd
from tqdm import tqdm
import pyspark.sql.functions as F
start_date = "2010-01-01"
end_date = "2010-12-01"
weeks = pd.date_range(start_date,end_date,freq='W-MON')
weeks = [str(i)[:10] for i in weeks]
for week in tqdm(weeks):
df = spark.read.parquet(some_file_path)
df = df.groupBy([col1, col2]).agg(F.sum(col1)).toPandas()
Currently seeing this progress bar
[Stage 701:=======================================> (7 + 2) / 10]
which has replaced the progress bar I would like to see
9%|▉ | 12/127 [15:47<2:06:02, 65.76s/it]
Is there a way around this? Thanks in advance.
Capturing the output of the loop fixed the issue for me.
import pandas as pd
from tqdm import tqdm
import pyspark.sql.functions as F
from IPython.utils import io
start_date = "2010-01-01"
end_date = "2010-12-01"
weeks = pd.date_range(start_date,end_date,freq='W-MON')
weeks = [str(i)[:10] for i in weeks]
for week in tqdm(weeks, total=len(weeks)):
with io.capture_output() as captured:
df = spark.read.parquet(some_file_path)
df = df.groupBy([col1, col2]).agg(F.sum(col1)).toPandas()