I am looking for documentation on how parquet.enable.dictionary is to be used in Spark (latest 2.3.1). It can be set to "true" or "false" when creating a SparkSession.
I googled for any documentation on this feature and found nothing, or at least nothing recent.
Specifically these are my questions:
Is parquet.filter.dictionary.enabled = true or = false by default in Spark 2.3.1?
Is this a feature to enable (set to true) before I write to Parquet files so that Parquet library of Spark computes and writes the dictionary information to disk?
Is this setting ignored when Spark reads the Parquet files or do I still need to set it to true for reading parquet (as well as writing)?
When should I use this feature (set to true) ? Pros/Cons?
I also see references to this spark.hadoop.parquet.enable.dictionary when I googled for the parquet.enable.dictionary. Is this related? Which should I use?
Are there any other Spark + Parquet settings I need to be aware of?
Many thanks!
These are the spark parquet config set to false by default -
spark.sql.parquet.mergeSchema
spark.sql.parquet.respectSummaryFiles
spark.sql.parquet.binaryAsString
spark.sql.parquet.int96TimestampConversion
spark.sql.parquet.int64AsTimestampMillis
spark.sql.parquet.writeLegacyFormat
spark.sql.parquet.recordLevelFilter.enabled
Below are set to true by default -
spark.sql.parquet.int96AsTimestamp
spark.sql.parquet.filterPushdown
spark.sql.parquet.filterPushdown.date
spark.sql.parquet.filterPushdown.timestamp
spark.sql.parquet.filterPushdown.decimal
spark.sql.parquet.filterPushdown.string.startsWith
spark.sql.parquet.enableVectorizedReader
These properties needs value and listing it with defaults-
spark.sql.parquet.outputTimestampType = INT96
spark.sql.parquet.compression.codec = snappy
spark.sql.parquet.pushdown.inFilterThreshold = 10
spark.sql.parquet.output.committer.class = org.apache.parquet.hadoop.ParquetOutputCommitter
spark.sql.parquet.columnarReaderBatchSize = 4096
Regarding parquet.enable.dictionary, it is not supported by Spark yet. But it can be set in sqlContext as -
sqlContext.setConf("parquet.enable.dictionary", "false")
Default value is of this property is true in parquet. Therefore, it should be true when parquet code is called from Spark.