Spark 2.3+ use of parquet.enable.dictionary?

I am looking for documentation on how parquet.enable.dictionary is to be used in Spark (latest 2.3.1). It can be set to "true" or "false" when creating a SparkSession.

I googled for any documentation on this feature and found nothing, or at least nothing recent.

Specifically these are my questions:

Is parquet.filter.dictionary.enabled = true or = false by default in Spark 2.3.1?

Is this a feature to enable (set to true) before I write to Parquet files so that Parquet library of Spark computes and writes the dictionary information to disk?

Is this setting ignored when Spark reads the Parquet files or do I still need to set it to true for reading parquet (as well as writing)?

When should I use this feature (set to true) ? Pros/Cons?

I also see references to this spark.hadoop.parquet.enable.dictionary when I googled for the parquet.enable.dictionary. Is this related? Which should I use?

Are there any other Spark + Parquet settings I need to be aware of?

Many thanks!

Solution

These are the spark parquet config set to false by default -

spark.sql.parquet.mergeSchema
spark.sql.parquet.respectSummaryFiles
spark.sql.parquet.binaryAsString
spark.sql.parquet.int96TimestampConversion
spark.sql.parquet.int64AsTimestampMillis
spark.sql.parquet.writeLegacyFormat
spark.sql.parquet.recordLevelFilter.enabled

Below are set to true by default -

spark.sql.parquet.int96AsTimestamp
spark.sql.parquet.filterPushdown
spark.sql.parquet.filterPushdown.date
spark.sql.parquet.filterPushdown.timestamp
spark.sql.parquet.filterPushdown.decimal
spark.sql.parquet.filterPushdown.string.startsWith
spark.sql.parquet.enableVectorizedReader

These properties needs value and listing it with defaults-

spark.sql.parquet.outputTimestampType = INT96
spark.sql.parquet.compression.codec = snappy
spark.sql.parquet.pushdown.inFilterThreshold = 10
spark.sql.parquet.output.committer.class = org.apache.parquet.hadoop.ParquetOutputCommitter
spark.sql.parquet.columnarReaderBatchSize = 4096

Regarding parquet.enable.dictionary, it is not supported by Spark yet. But it can be set in sqlContext as -

sqlContext.setConf("parquet.enable.dictionary", "false")

Default value is of this property is true in parquet. Therefore, it should be true when parquet code is called from Spark.