pandas apache-spark databricks etl delta-lake

Unreasonable memory consumption for delta lake on pandas

I am working on Databricks using Delta Lake.

I have a dataset that is approx 1.9GB in size (parquet format). I am trying to convert this dataset to a Delta Table and I was able to successfully create such table with spark without any problems.

This dataset is used in a pretty standard ETL pipeline.

However, in my specific use case (due to some legacy code that works only in pandas), some processing requires pandas, so what I was willing to do is to cast my delta table from spark to pandas using my_pandas_df = my_delta_table.select("*").toPandas().

The code works fine, but I noticed an unreasonable consumption of memory because of this command.

If I read the original parquet dataset with pandas.read_parquet() I have this memory footprint on my databricks cluster:

However, if I read the Delta Table with Spark and then cast it to pandas, I obtain this very different memory footprint:

As you might notice, the memory consumption is MUCH higher for the same identical dataset. Considering that Delta is using parquet files under the hood, I can't understand why there is such a huge difference in memory consumption for the same identical data.

Any ideas?

Solution

In the end I solved the problem by relying of the delta-rs library for python.

I ran the following code:

!pip install deltalake # Just to report which library I installed and how

from deltalake import DeltaTable

dt = DeltaTable("path_to_my_table/")
pandas_table = dt.to_pandas()

This way I was able to cast a delta table to pandas without having memory leakage and in a reasonable time (8millions rows x 400 columns loaded in 5 minutes).

NB: Be extra careful about partitioning in your underlying parquet files in the Delta table. It highly impacts the performance of the reading!