Search code examples
amazon-web-servicesapache-sparkamazon-s3pysparkamazon-kms

Pyspark dataframe read from one bucket and write to another bucket with different KMS keys in same job


need little help to find better solution for my use case below.

I have S3 bucket which contain input Data and it is encrypted with KMS KEY 1

so I am able to set the KMS KEY 1 to my spark session using "spark.hadoop.fs.s3.serverSideEncryption.kms.keyId"

and able to read the data,

now I want to write the data to another S3 bucket but it is encrypted with KMS KEY 2*

so what I am currently doing is, creating spark session with Key1 and read the data frame and convert that into Pandas data frame and kill the spark session and recreate the spark session with in same AWS glue job with KMS KEY2 and converting the pandas data which was created in previous step in to spark data frame and writing to output S3 bucket.

but this approach is causing datatype issues sometimes. is there any better alternate solution available to handle this use case ?

thanks in advance and your help is greatly appreciated.


Solution

  • you don't need to declare what key to use to decrypt data encrypted with S3-KMS; the keyID to use is attached as an attribute to the file. AWS S3 reads the encryption settings, sees the key ID, sends off the KMS-encrypted symmetric key to AWS KMS asking for that to be decrypted with the user/IAM role asking for the decryption. If the user/role has the right permission, S3 gets the unencrypted key back, decrypts the file and returns it.

    To read data from bucket encrypted with KMS-1, you should be able to set the key to the key2 value, (or no encryption at all), and still get the data back

    Disclaimer: I haven't tested this with the EMR s3 connector, just the apache S3A one, but since S3-KMS works the same everywhere, I'd expect this to hold. Encryption with client supplied keys S3-CSE is a different story. There you do need the clients correctly configured, which is why S3A supports per-bucket configuration.