apache-spark pyspark apache-spark-sql parquet

Attach description of columns in Apache Spark using parquet format

I read a parquet with :

df = spark.read.parquet(file_name)

And get the columns with:

df.columns

And returns a list of columns ['col1', 'col2', 'col3']

I read that parquet format is able to store some metadata in the file.

Is there a way to store and read extra metadata, for example, attach a human description of what is each column?

Thanks.

Solution

As of 2024 and Spark 3, Spark automatically reads and writes column descriptions in parquet files.

Here is a minimal example using PySpark demonstrating it. (The commented lines are the output printed by the program)

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

user_df = spark.sql("SELECT 'John' as first_name, 'Doe' as last_name")

user_df = user_df.withMetadata("first_name", {"comment": "The user's first name"})
user_df = user_df.withMetadata("last_name", {"comment": "The user's last name"})

for field in user_df.schema.fields:
    print(field.name, field.metadata)

# first_name {'comment': "The user's first name"}
# last_name {'comment': "The user's last name"}

user_df.write.mode("overwrite").parquet("user")

user_df_2 = spark.read.parquet("user")

for field in user_df_2.schema.fields:
    print(field.name, field.metadata)

# first_name {'comment': "The user's first name"}
# last_name {'comment': "The user's last name"}