Search code examples
pysparkazure-databricksapache-iceberg

Is there a way to write pyspark dataframe as iceberg format outside of hive metastore?


I am working on Azure Databricks and I need to save the pyspark dataframe to a mount-point which is not configured as a Hive metastore. This location is then synced for another cloud provider.

I have tried several methods but it keeps on failing for me.

dataframe.write.format("iceberg").option("path","/mnt/<path>/<table-name>").save()

This gives me an error: Cannot connect to Hive metastore. Is there a way to force it to not use Hive metastore?

FYI, using the following approach did save it to the same mount point, but it keeps the data in parquet and not iceberg.

dataframe.write.mode("overwrite").format("iceberg").parquet(path="/mnt/<path>/<table-name>")

I am new to data formats so I am not completely able to differentiate parquet and iceberg.


Solution

  • Lets first understand the difference between parquet and Iceberg.

    Parquet: Is a file format like txt, csv, orc. But parquet is not human readable unlike csv because it is optimized for big data and hence can be read only parquet reader. It is considered one of the best for storing data in columnar format i.e. storing data in columns as opposed to csv format or like traditional oltp databases like oracle, sql server. But in reality it stores data in hybrid format i.e. data is first partitioned into row groups by partitioning the data horizontally and then within each row group data from each column is saved in chunks and pages which contains data from a particular column. It also stores metadata for each row group and blocks such as min and max value which can be used by the execution engines to filter the data. Also data is highly compressed and data can be processed in parallel by spark and other execution engines. More about parquet here

    Iceberg: Unlike parquet it is table format which is very similar to Hive i.e. it does actually stores any data but stores the metadata about the actual data. Data can be in various formats like Parquet, Orc etc. This metadata which is stored in basically three different files(Metadata.json, manifest_list and manifest files) provides an open table format and can be utilized by various engines like Spark, Trino, Flink, Dremio etc to read the actuall data stored in Data Lake/Lakehouse in the different file formats supported by Iceberg. Read more about Iceberg here

    Now coming to your original question. You need to configure your Iceberg(not sure how we do it in Azure as I have always used HDFS, Spark on prem) which is basically a set of jars and then you should provide location to save the files which will be written by execution engine in your case Databricks.