I have a big text file (100MB or more) and I want to use AES algorithm to encrypt the content of the text file using Hadoop and Java (Map/Reduce functions), but as I am new to Hadoop, I am not really sure how to start this. I found JCE (a Java library) where AES is already implemented but I have to provide 16 bytes text along with a key to generate a 16 bytes cipher text (encrypted output). My question is how to use this JCE/AES method to get my purpose done? How should I split my big input text file and What should I pass to the map method of Mapper class? what should be the key and value? What should be passed to the Reduce method? Any kind of starting point or code example would be greatly appreciated. (P.S. I am new to Hadoop and I just ran the wordcount problem on my machine, that's it.)
EDIT 1:
Actually, I have to do the following things:
My question now is, how to parallelize it using Hadoop's Map and Reduce methods? what should be the key and how to accumulate the output cipher texts in the output file?
Encrypting a large stream with a block cipher requires you to resolve a fundamental issue, completely irrelevant of how you actually split the work (M/R or whatever). The problem is the cipher-block chaining. Because each block is dependent on the output of the previous block, you cannot encrypt (or decrypt) block N w/o first encrypting (or decrypting) block N-1. This implies that you can only encrypt the file one block at a time, starting with block 1, then block 2, then 3 and so on.
To work around the problem all encryption solutions do the same: they split the stream in chunks of adequate size (the right size is always a trade-off) and use some out-of-band storage where they associate each chunk with a startup nonce (initialization vector). This way chunks can be ecnrypted and decrypted independently.
HDFS has a natural chunk (the block) and the access patterns on blocks are single threaded and sequential, lending itself to be the natural choice for encryption chunks. Adding the extra metadata on the namenode for the nonce on each block is relatively straightforward to do. If you do this for your own education, this is a fun project to tackle. Key management is a separate issue, and of course, as with any encryption scheme, key management is the actually important part while implementing the cipher is the trivial part.
If you are thinking at this for real world use, stop right now. Use an off-the-shelf encryption solution for Hadoop, of which there are several