Problem
We want to encrypt personally identifiable information. They should not be readable. However, because the results will also be used for machine learning, each time a value (say "ABC") gets encrypted, the resulting data should be the same.
Most encryption ciphers include a initialization vector. This goes against what we need. To be clear, the data is supposed to be encrypted, yet this doesn't need to be bullet proof. The data is never transferred outside of the organization and this is simply done to adhere to GDPR.
Context
We have decided to use bouncy castle because it supports a large number of encryption modes, including the (apparently fast ECC). Since we are talking about encrypting several TB a day, it would be nice to have good performance.
Solution issues
Although the bouncy castle library is well written, it seems difficult to find good documentation and usage examples on it. I am struggling to find my entrypoint. Do I have to look at the org.bouncycastle.crypto
, or org.bouncycastle.crypto.engines
package? or the crypto.ec
? I found the ZeroBytePadding
class which I believe should point me to a potential engine that does what i want but I cannot find what I am looking for.
Goal
A class that has a set of methods similar to this:
class Anonomyzer{
def initialize(publicKey: String, privateKey: String): Unit
def encode(data: Array[Byte]): Array[Byte]
def decode(data: Array[Byte]): Array[Byte]
}
The following code should be true
Anonomyzer.initialize("PUBLIC", "PRIVATE")
val once = Anonomyzer.encode(data)
val twice = Anonomyzer.encode(data)
Arrays.equals(once, twice)
Edit: I've read more on this and found that what I am looking for is called Electronic Codebook mode of operation. Although this is not perfectly secure, this is the best we can hope for AFAIK.
However, because the results will also be used for machine learning, each time a value (say "ABC") gets encrypted, the resulting data should be the same
You may have more options than that. It is stil safer to properly encrypt data where they need to be encrypted. You may have different datasets for different purposes.
Just suggestions:
Taking shortcuts (using ECB or static IV) may in some cases completely break the security of encrypted data. So until you really know what are you doing, you may shoot yourself in your leg
We have decided to use bouncy castle because it supports a large number of encryption modes, including the (apparently fast ECC)
I'd say - you don't needed the BC library. It is a very well written library, but in your case I don't see any specific need for it.
apparently fast ECC). Since we are talking about encrypting several TB a day, it would be nice to have good performance
ECC is still asymmetric encryption usually used for hybrid encryption (encrypting a symmetric data encryption key). So if you aim for speed, you may use check that your JVM and VM allows native AES-NI support or use some fast cipher (salsa,..). Encryption is usually not the performance bottleneck if done properly
I am struggling to find my entrypoint.
In most of the cases you may use default Java crypto API with specified provider
Security.addProvider(new BouncyCastleProvider());
...
Cipher cipher = Cipher.getInstance("AES/OFB/NoPadding", "BC");
or
Cipher cipher = Cipher.getInstance("AES/CBC/PKCS5Padding", "BC");
Edit: fixed padding combinations