Search code examples
apache-sparkpysparkparquet

Documentation for spark options


This one is probably easy to answer, but for the life of me I can't seem to find this.

Can someone please point me to documentation for the various key-value pair options that you can use with spark?

Example of such an option (in pyspark):

some_spark_table.write.format("parquet").option("parquet.block.size", 1234)

So if I'm interested in what the unit is for the parquet.block.size option, where do I find that?

I found this link which helpfully states: "To find more detailed information about the extra ORC/Parquet options, visit the official Apache ORC/Parquet websites." But I still can't find it.


Solution

  • As doc says, you can visit the official Apache Parquet website. I think by official website they mean Parquet git repo :)

    Citing from there:

    Property: parquet.block.size
    Description: The block size in bytes. This property depends on the file system:

    • If the file system (FS) used supports blocks like HDFS, the block size will be the maximum between the default block size of FS and this property. And the row group size will be equal to this property.

      • block_size = max(default_fs_block_size, parquet.block.size)
      • row_group_size = parquet.block.size
    • If the file system used doesn't support blocks, then this property will define the row group size.

    Note that larger values of row group size will improve the IO when reading but consume more memory when writing
    Default value: 134217728 (128 MB)

    Unlike Parquet, Spark's own configuration settings are pretty well documented (the ones they want you to know about) on its website as pointed in another answer.