Search code examples
pythonpandasdataframeencryptionseal

How to encrypt data frame column through PySeal library


I am doing research on Fully homomorphic encryption. Since only fully homomorphic encryption allows to perform computation on encrypted data and this mechanism provides by the PySeal library which is a python fork version of the Microsoft SEAL library. I have 3 columns in my data frame. I want to encrypt each value of every column using PySeal that I can do computation on those value.

df

| SNP  | ID     | Effect|
|:---- |:------:| -----:|
| 21515| 1      | 0.5   |
| 21256| 2      | 0.7   |
| 21286| 3      | 1.7   |

related documents of PySeal: https://github.com/Lab41/PySEAL/blob/master/SEALPythonExamples/examples.py


Solution

  • Interesting question, I can help you with using the library with pandas but not with setting secure encryption parameters like the moduli.

    First let's do some imports:

    import pandas
    import seal
    from seal import Ciphertext, \
        Decryptor, \
        Encryptor, \
        EncryptionParameters, \
        Evaluator, \
        IntegerEncoder, \
        FractionalEncoder, \
        KeyGenerator, \
        Plaintext, \
        SEALContext
    

    Now we set the encryption parameters. I do not know enough to advise you on how to set these correctly, but getting the values correct is important to achieve proper security. A quote from the documentation:

    It is critical to understand how these different parameters behave, how they affect the encryption scheme, performance, and the security level... due to the complexity of this topic, we highly recommend the user to directly consult an expert in homomorphic encryption and RLWE-based encryption schemes to determine the security of their parameter choices.

    parms = EncryptionParameters()
    parms.set_poly_modulus("1x^2048 + 1")
    parms.set_coeff_modulus(seal.coeff_modulus_128(2048))
    parms.set_plain_modulus(1 << 8)
    context = SEALContext(parms)
    

    Next we'll setup keys, encoders, crypters and decrypters.

    iEncoder = IntegerEncoder(context.plain_modulus())
    fEncoder = FractionalEncoder(
        context.plain_modulus(), context.poly_modulus(), 64, 32, 3)
    
    keygen = KeyGenerator(context)
    public_key = keygen.public_key()
    secret_key = keygen.secret_key()
    encryptor = Encryptor(context, public_key)
    evaluator = Evaluator(context)
    decryptor = Decryptor(context, secret_key)
    

    Lets setup some handy functions we will use with DataFrames to encrypt and decrypt.

    def iencrypt(ivalue):
        iplain = iEncoder.encode(ivalue)
        out = Ciphertext()
        encryptor.encrypt(iplain, out)
        return out
    
    def fencrypt(fvalue):
        fplain = fEncoder.encode(fvalue)
        out = Ciphertext()
        encryptor.encrypt(fplain, out)
        return out
    

    Finally we'll define a multiplication operation on integers that we can use with pandas. To keep this answer short we won't demonstrate an operation on floating point numbers but it shouldn't be hard to make one.

    def i_multiplied(multiplier):
        m_plain = iEncoder.encode(multiplier)
        out = Ciphertext()
        encryptor.encrypt(m_plain, out)
        def aux(enc_value):
            # this is an in-place operation, so there is nothing to return
            evaluator.multiply(enc_value, out)
        return aux
    

    Note that Evaluator.multiple is an inplace operation so when we use it with a DataFrame it will mutate the values inside!

    Now let's put it all to work:

    df = pandas.DataFrame(dict(
        SNP=[21515, 21256, 21286],
        ID=[1, 2, 3],
        Effect=[0.5, 0.7, 1.7])
    )
    print("Input/Plaintext Values:")
    print(df.head())
    

    This prints your example:

    Input/Plaintext Values:
         SNP  ID  Effect
    0  21515   1     0.5
    1  21256   2     0.7
    2  21286   3     1.7
    

    Now let's make an encrypted dataframe:

    enc_df = pandas.DataFrame(dict(
        xSNP=df['SNP'].apply(iencrypt),
        xID=df['ID'].apply(iencrypt),
        xEffect=df['Effect'].apply(fencrypt))
    )
    
    print("Encrypted Values:")
    print(enc_df.head())
    

    Prints:

    Encrypted Values:

    _  xSNP                           
    0  <seal.Ciphertext object at 0x7efcccfc2df8>  <seal.Ciphertext object a
    1  <seal.Ciphertext object at 0x7efcccfc2d88>  <seal.Ciphertext object a
    2  <seal.Ciphertext object at 0x7efcccfc2dc0>  <seal.Ciphertext object a
    

    Which is just a bunch of objects in DataFrame.

    Now let's do an operation.

    # multiply in place
    enc_df[['xSNP','xID']].applymap(i_multiplied(2))
    
    print("Encrypted Post-Op Values:")
    print(enc_df.head())
    

    You won't notice a difference in values printed at this point because all we did was mutate the objects in the dataframe, so it will just print the same memory references.

    Now let's decrypt to see the results:

    enc_df[['xSNP','xID']]=enc_df[['xSNP','xID']].applymap(idecrypt)
    
    print("Decrypted Post-Op Values:")
    print(enc_df[['xSNP','xID']].head())
    

    This prints:

    Decrypted Post-Op Values:
        xSNP  xID
    0  43030    2
    1  42512    4
    2  42572    6
    

    Which is the result you'd expect multiplying the integer columns by two.

    To use this practically you would have to serialise the encrypted dataframe before sending into to the other party to be worked on and then returned to you to be decrypted. The library forces you to use pickle to do this. This is unfortunate from a security point of view since you should never unpickle untrusted data. Can the server trust the client not to put anything nasty in the pickle serialisation and can the client trust that server won't do the same when it returns answer? In general the answer to both would be no, more-so here since the client already doesn't trust the server, otherwise it would not be using homomorphic encryption! Clearly these python bindings are more of a tech-demonstrator, but I thought it was worth pointing out this limitation.

    There are batch operations in the library, which I have not demonstrated. These may make more sense to use in the context of DataFrames, since they should have better performance for operations over many values.