Search code examples
apache-sparkparquet

Spark 2.3+ use of parquet.enable.dictionary?


I am looking for documentation on how parquet.enable.dictionary is to be used in Spark (latest 2.3.1). It can be set to "true" or "false" when creating a SparkSession.

I googled for any documentation on this feature and found nothing, or at least nothing recent.

Specifically these are my questions:

Is parquet.filter.dictionary.enabled = true or = false by default in Spark 2.3.1?

Is this a feature to enable (set to true) before I write to Parquet files so that Parquet library of Spark computes and writes the dictionary information to disk?

Is this setting ignored when Spark reads the Parquet files or do I still need to set it to true for reading parquet (as well as writing)?

When should I use this feature (set to true) ? Pros/Cons?

I also see references to this spark.hadoop.parquet.enable.dictionary when I googled for the parquet.enable.dictionary. Is this related? Which should I use?

Are there any other Spark + Parquet settings I need to be aware of?

Many thanks!


Solution

  • These are the spark parquet config set to false by default -

    spark.sql.parquet.mergeSchema
    spark.sql.parquet.respectSummaryFiles
    spark.sql.parquet.binaryAsString
    spark.sql.parquet.int96TimestampConversion
    spark.sql.parquet.int64AsTimestampMillis
    spark.sql.parquet.writeLegacyFormat
    spark.sql.parquet.recordLevelFilter.enabled
    

    Below are set to true by default -

    spark.sql.parquet.int96AsTimestamp
    spark.sql.parquet.filterPushdown
    spark.sql.parquet.filterPushdown.date
    spark.sql.parquet.filterPushdown.timestamp
    spark.sql.parquet.filterPushdown.decimal
    spark.sql.parquet.filterPushdown.string.startsWith
    spark.sql.parquet.enableVectorizedReader
    

    These properties needs value and listing it with defaults-

    spark.sql.parquet.outputTimestampType = INT96
    spark.sql.parquet.compression.codec = snappy
    spark.sql.parquet.pushdown.inFilterThreshold = 10
    spark.sql.parquet.output.committer.class = org.apache.parquet.hadoop.ParquetOutputCommitter
    spark.sql.parquet.columnarReaderBatchSize = 4096
    

    Regarding parquet.enable.dictionary, it is not supported by Spark yet. But it can be set in sqlContext as -

    sqlContext.setConf("parquet.enable.dictionary", "false")
    

    Default value is of this property is true in parquet. Therefore, it should be true when parquet code is called from Spark.