Search code examples
pandasapache-sparkdatabricksetldelta-lake

Unreasonable memory consumption for delta lake on pandas


I am working on Databricks using Delta Lake.

I have a dataset that is approx 1.9GB in size (parquet format). I am trying to convert this dataset to a Delta Table and I was able to successfully create such table with spark without any problems.

This dataset is used in a pretty standard ETL pipeline.

However, in my specific use case (due to some legacy code that works only in pandas), some processing requires pandas, so what I was willing to do is to cast my delta table from spark to pandas using my_pandas_df = my_delta_table.select("*").toPandas().

The code works fine, but I noticed an unreasonable consumption of memory because of this command.

If I read the original parquet dataset with pandas.read_parquet() I have this memory footprint on my databricks cluster: Pandas read parquet memory footprint

However, if I read the Delta Table with Spark and then cast it to pandas, I obtain this very different memory footprint: Delta table memory footprint

As you might notice, the memory consumption is MUCH higher for the same identical dataset. Considering that Delta is using parquet files under the hood, I can't understand why there is such a huge difference in memory consumption for the same identical data.

Any ideas?


Solution

  • In the end I solved the problem by relying of the delta-rs library for python.

    I ran the following code:

    !pip install deltalake # Just to report which library I installed and how
    
    from deltalake import DeltaTable
    
    dt = DeltaTable("path_to_my_table/")
    pandas_table = dt.to_pandas()
    

    This way I was able to cast a delta table to pandas without having memory leakage and in a reasonable time (8millions rows x 400 columns loaded in 5 minutes).

    NB: Be extra careful about partitioning in your underlying parquet files in the Delta table. It highly impacts the performance of the reading!