Search code examples
pysparkhdfsschemaparquet

How do I create a metadata file in HDFS when writing a Parquet file as output from a Dataframe in PySpark?


I have a Spark transformation program which reads 2 Parquet files and creates one final Dataframe which is then written to a Parquet file in another directory in HDFS.

Is there a way to create a meta data/Schema file of the Parquet in the same directory as the parquet in HDFS?

We require this metadata/schema file for another processing.


Solution

  • Assuming the consumer of the meta file is not the consumer of the parquet file (as then a meta file is redundant as the schema is embedded in parquet format), you could use the schema property on the dataframe and write that to a file as a string.

    Note, that you cannot write this meta file to the same path as the parquet file as you will get an error when you try and read the parquet file back, but could write it to the parent directory.