Search code examples
apache-sparkparquet

spark parquet enable dictionary


I am running a spark job to write to parquet. I want to enable dictionary encoding for the files written. When I check the files, I see they are 'plain dictionary'. However, I do not see any stats for these columns

Let me know if I am missing anything

val ip = spark.read.parquet("/home/hadoop/work/cube/data/date=2020-02-01")
val ip1 = ip.groupBy("asn","country_code").agg(sum("total_hits").as("total_hits")).sort("asn")
ip1.write.parquet("/home/hadoop/work/cube/test_parquet_dictionary/att3")

Here is what I see in meta

parquet-tools meta hdfs://<cluster>:50001//home/hadoop/work/cube/test_parquet_dictionary/att3/part-00190-522feb80-6fd7-4147-87f4-781b4e2c3599-c000.snappy.parquet

creator:           parquet-mr version 1.5.0-cdh5.13.3 (build ${buildNumber}) 
extra:             org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"asn","type":"string","nullable":true,"metadata":{}},{"name":"country_code","type":"string","nullable":true,"metadata":{}},{"name":"total_ [more]...

file schema:       spark_schema 
---------------------------------------------------------------------------------------------------------
asn:          OPTIONAL BINARY O:UTF8 R:0 D:1
country_code: OPTIONAL BINARY O:UTF8 R:0 D:1
total_hits:   OPTIONAL INT64 R:0 D:1

row group 1:       RC:149 TS:2750 
---------------------------------------------------------------------------------------------------------
asn:           BINARY SNAPPY DO:0 FPO:4 SZ:665/1361/2.05 VC:149 ENC:BIT_PACKED,PLAIN,RLE
country_code:  BINARY SNAPPY DO:0 FPO:669 SZ:300/331/1.10 VC:149 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
total_hits:    INT64 SNAPPY DO:0 FPO:969 SZ:668/1058/1.58 VC:149 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY

Here is the footer info as well

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
asn:           BINARY SNAPPY DO:0 FPO:4 SZ:665/1361/2.05 VC:149 ENC:RLE,PLAIN,BIT_PACKED
country_code:  BINARY SNAPPY DO:0 FPO:669 SZ:300/331/1.10 VC:149 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
total_hits: INT64 SNAPPY DO:0 FPO:969 SZ:668/1058/1.58 VC:149 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED

asn TV=149 RL=0 DL=1
-----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1324 VC:149

country_code TV=149 RL=0 DL=1 DS: 30 DE:PLAIN_DICTIONARY
---------------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:104 VC:149

total_hits TV=149 RL=0 DL=1 DS:        107 DE:PLAIN_DICTIONARY
---------------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:142 VC:149

Spark version ==>

sc.version
res10: String = 2.4.0.cloudera2

Solution

  • Got the answer. The parquet tools version I was using was 1.6. Upgrading to 1.10 solved the issue