Search code examples
javascalaencryptionbouncycastle

How to encrypt data with bouncy castle while ensuring the result is deterministic


Problem

We want to encrypt personally identifiable information. They should not be readable. However, because the results will also be used for machine learning, each time a value (say "ABC") gets encrypted, the resulting data should be the same.

Most encryption ciphers include a initialization vector. This goes against what we need. To be clear, the data is supposed to be encrypted, yet this doesn't need to be bullet proof. The data is never transferred outside of the organization and this is simply done to adhere to GDPR.

Context

We have decided to use bouncy castle because it supports a large number of encryption modes, including the (apparently fast ECC). Since we are talking about encrypting several TB a day, it would be nice to have good performance.

Solution issues

Although the bouncy castle library is well written, it seems difficult to find good documentation and usage examples on it. I am struggling to find my entrypoint. Do I have to look at the org.bouncycastle.crypto, or org.bouncycastle.crypto.engines package? or the crypto.ec? I found the ZeroBytePadding class which I believe should point me to a potential engine that does what i want but I cannot find what I am looking for.

Goal

A class that has a set of methods similar to this:

class Anonomyzer{
  def initialize(publicKey: String, privateKey: String): Unit
  def encode(data: Array[Byte]): Array[Byte]
  def decode(data: Array[Byte]): Array[Byte]
}

The following code should be true

Anonomyzer.initialize("PUBLIC", "PRIVATE")
val once = Anonomyzer.encode(data)
val twice = Anonomyzer.encode(data)
Arrays.equals(once, twice)

Edit: I've read more on this and found that what I am looking for is called Electronic Codebook mode of operation. Although this is not perfectly secure, this is the best we can hope for AFAIK.


Solution

  • However, because the results will also be used for machine learning, each time a value (say "ABC") gets encrypted, the resulting data should be the same

    You may have more options than that. It is stil safer to properly encrypt data where they need to be encrypted. You may have different datasets for different purposes.

    Just suggestions:

    • you may anonymize the learning dataset, stripping data of their PII and aggregate them to reasonable level, still valuable for ML. I'd prefer this option because then it's clean without risking to breach any rules or leaking protected information
    • you may hash PII (or categorical data), which would provide unique mapping without reversable mapping (though there will be always mapping from the original values)
    • for quantitative data you may search up "order preserving encryption" which may not be trivial to do properly (that's one of reasons why I'd go for the 1st option)

    Taking shortcuts (using ECB or static IV) may in some cases completely break the security of encrypted data. So until you really know what are you doing, you may shoot yourself in your leg

    We have decided to use bouncy castle because it supports a large number of encryption modes, including the (apparently fast ECC)

    I'd say - you don't needed the BC library. It is a very well written library, but in your case I don't see any specific need for it.

    apparently fast ECC). Since we are talking about encrypting several TB a day, it would be nice to have good performance

    ECC is still asymmetric encryption usually used for hybrid encryption (encrypting a symmetric data encryption key). So if you aim for speed, you may use check that your JVM and VM allows native AES-NI support or use some fast cipher (salsa,..). Encryption is usually not the performance bottleneck if done properly

    I am struggling to find my entrypoint.

    In most of the cases you may use default Java crypto API with specified provider

    Security.addProvider(new BouncyCastleProvider());
    ... 
     Cipher cipher = Cipher.getInstance("AES/OFB/NoPadding", "BC");
    

    or

    Cipher cipher = Cipher.getInstance("AES/CBC/PKCS5Padding", "BC");
    

    Edit: fixed padding combinations