Search code examples
apache-sparkencryptionparquet

Parquet Encryption: How to encrypt array of structs?


Since spark 3.2, there is this interesting functionality from Parquet: Parquet Columnar Encryption

The documentation is pretty clear on how to specify which key to use for a specific column in the dataframe schema. I.e.:

squaresDF.write. option("parquet.encryption.column.keys" , "keyA:square")

if we want to encrypt a column called square with a key indentified by keyA tag in our KMS system.

The problem is: how to specify the column name if my column is an array of a Struct type ?

For example

myDF.printSchema

root
|-- int_column: integer (nullable = false)
|-- square_int_column: double (nullable = false)
|-- more: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- name: string (nullable = true)
|    |    |-- description: string (nullable = true)

How can I specify the key for the column more ? or for column more.name ? Is it supported ? I cannot find anything on the parquet or spark doc about that.


Solution

  • After some research,

    I decided to explore a generated parquet file with parquet-tools in order to understand how arrays of struct are organised in the file.

    So, after creating a parquet file with the needed schema, I opened it with:

     java ~/parquet-tools-1.11.0.jar meta <my-parquet-file-path> | less
    

    Checking in the metadata of the columns, I found that:

    [...cut...]
    more:                               OPTIONAL F:1
    .list:                              REPEATED F:1
    ..element:                          OPTIONAL F:8
    ...name:                            OPTIONAL BINARY L:STRING R:1 D:4
    [...cut...]
    

    So, to encrypt that, we need to specify the column as:

    squaresDF.write
       .option("parquet.encryption.column.keys" , "keyA:more.list.element.name")