Hadoop for Large Image Processing

I have a 50TB set of ~1GB tiff images that I need to run the same algorithm on. Currently, I have the rectification process written in C++ and it works well, however it will take forever to run on all these images consecutively. I understand that an implementation of MapReduce/Spark could work, but I can't seem to figure out how to use image input/output.

Every tutorial/example that I've seen uses plain text. In theory, I would like to utilize Amazon Web Services too. If anyone has some direction for me, that would be great. I'm obviously not looking for a full solution, but maybe someone has successfully implemented something close to this? Thanks in advance.

Solution

Is your data in HDFS? What exactly do you expect to leverage from Hadoop/Spark? Seems to me that all you need is a queue of filenames and a bunch of machines to execute.

You can pack your app into AWS Lambda (see Running Arbitrary Executables in AWS Lambda) and trigger events for each file. You can pack your app into a Docker container and start up a bunch of them in ECS, let them loose on a queue of filenames (or URLs or S3 buckets) to process.

I think Hadoop/Spark is overkill, specially since they're quite bad at processing 1GB splits as input, and your processing is not a M/R (no key-values for reducers to merge). If you must, you could pack your C++ app to read from stdin and use Hadoop Streaming.

Ultimately, the question is: where are the 50TB data stored, and what format? The solution depends a lot on the answer, as you would like to bring compute to where the data is and avoid transferring 50TB into AWS or even upload into HDFS.