Since spark 3.2, there is this interesting functionality from Parquet: Parquet Columnar Encryption
The documentation is pretty clear on how to specify which key to use for a specific column in the dataframe schema. I.e.:
squaresDF.write. option("parquet.encryption.column.keys" , "keyA:square")
if we want to encrypt a column called square
with a key indentified by keyA
tag in our KMS system.
The problem is: how to specify the column name if my column is an array of a Struct type ?
For example
myDF.printSchema
root
|-- int_column: integer (nullable = false)
|-- square_int_column: double (nullable = false)
|-- more: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- description: string (nullable = true)
How can I specify the key for the column more
? or for column more.name
? Is it supported ? I cannot find anything on the parquet or spark doc about that.
After some research,
I decided to explore a generated parquet file with parquet-tools in order to understand how arrays of struct are organised in the file.
So, after creating a parquet file with the needed schema, I opened it with:
java ~/parquet-tools-1.11.0.jar meta <my-parquet-file-path> | less
Checking in the metadata of the columns, I found that:
[...cut...]
more: OPTIONAL F:1
.list: REPEATED F:1
..element: OPTIONAL F:8
...name: OPTIONAL BINARY L:STRING R:1 D:4
[...cut...]
So, to encrypt that, we need to specify the column as:
squaresDF.write
.option("parquet.encryption.column.keys" , "keyA:more.list.element.name")