Search code examples
pythonpysparkjupyter-labtqdm

Is there a way to see TQDM progress bars while using PySpark?


By default, PySpark (v3.3.2) is printing progress logs to my jupyter notebook. It is overwriting the TQDM logs that I would to see to keep progress of the estimated computation time.

import pandas as pd
from tqdm import tqdm
import pyspark.sql.functions as F

start_date = "2010-01-01"
end_date = "2010-12-01"

weeks = pd.date_range(start_date,end_date,freq='W-MON')
weeks = [str(i)[:10] for i in weeks]

for week in tqdm(weeks):
   df = spark.read.parquet(some_file_path)
   df = df.groupBy([col1, col2]).agg(F.sum(col1)).toPandas()

Currently seeing this progress bar

[Stage 701:=======================================> (7 + 2) / 10]

which has replaced the progress bar I would like to see

9%|▉ | 12/127 [15:47<2:06:02, 65.76s/it]

Is there a way around this? Thanks in advance.


Solution

  • Capturing the output of the loop fixed the issue for me.

    import pandas as pd
    from tqdm import tqdm
    import pyspark.sql.functions as F
    from IPython.utils import io
    
    start_date = "2010-01-01"
    end_date = "2010-12-01"
    
    weeks = pd.date_range(start_date,end_date,freq='W-MON')
    weeks = [str(i)[:10] for i in weeks]
    
    for week in tqdm(weeks, total=len(weeks)):
        with io.capture_output() as captured:
            df = spark.read.parquet(some_file_path)
            df = df.groupBy([col1, col2]).agg(F.sum(col1)).toPandas()