I am working on Databricks using Delta Lake.
I have a dataset that is approx 1.9GB in size (parquet format). I am trying to convert this dataset to a Delta Table and I was able to successfully create such table with spark without any problems.
This dataset is used in a pretty standard ETL pipeline.
However, in my specific use case (due to some legacy code that works only in pandas), some processing requires pandas, so what I was willing to do is to cast my delta table from spark to pandas using my_pandas_df = my_delta_table.select("*").toPandas()
.
The code works fine, but I noticed an unreasonable consumption of memory because of this command.
If I read the original parquet dataset with pandas.read_parquet()
I have this memory footprint on my databricks cluster:
However, if I read the Delta Table with Spark and then cast it to pandas, I obtain this very different memory footprint:
As you might notice, the memory consumption is MUCH higher for the same identical dataset. Considering that Delta is using parquet files under the hood, I can't understand why there is such a huge difference in memory consumption for the same identical data.
Any ideas?
In the end I solved the problem by relying of the delta-rs library for python.
I ran the following code:
!pip install deltalake # Just to report which library I installed and how
from deltalake import DeltaTable
dt = DeltaTable("path_to_my_table/")
pandas_table = dt.to_pandas()
This way I was able to cast a delta table to pandas without having memory leakage and in a reasonable time (8millions rows x 400 columns loaded in 5 minutes).
NB: Be extra careful about partitioning in your underlying parquet files in the Delta table. It highly impacts the performance of the reading!