I read a parquet with :
df = spark.read.parquet(file_name)
And get the columns with:
df.columns
And returns a list of columns ['col1', 'col2', 'col3']
I read that parquet format is able to store some metadata in the file.
Is there a way to store and read extra metadata, for example, attach a human description of what is each column?
Thanks.
As of 2024 and Spark 3, Spark automatically reads and writes column descriptions in parquet files.
Here is a minimal example using PySpark demonstrating it. (The commented lines are the output printed by the program)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
user_df = spark.sql("SELECT 'John' as first_name, 'Doe' as last_name")
user_df = user_df.withMetadata("first_name", {"comment": "The user's first name"})
user_df = user_df.withMetadata("last_name", {"comment": "The user's last name"})
for field in user_df.schema.fields:
print(field.name, field.metadata)
# first_name {'comment': "The user's first name"}
# last_name {'comment': "The user's last name"}
user_df.write.mode("overwrite").parquet("user")
user_df_2 = spark.read.parquet("user")
for field in user_df_2.schema.fields:
print(field.name, field.metadata)
# first_name {'comment': "The user's first name"}
# last_name {'comment': "The user's last name"}