Search code examples
apache-sparkrddamazon-emr

RDD to in.file to external process to out.file to RDD


I need to call an external process from my EMR Spark job. I see that rdd.pipe would allow me to pipe an RDD to a process. (As an aside, is that one process per RDD, or one per element?).

However, my external process requires a filename as input and generates a file as output.

How can I invoke this external process and subsequently load the output file as an RDD?


Solution

  • is that one process per RDD, or one per element?

    Neither. It is a process per partition.

    process requires a filename as input and generates a file as output. How can

    The simplest solution is to write a simple wrapper which writes to randomly generated path, invokes your program, reads the file and writes to stdout and this is pretty much all what pipe is about. Unless you write to distributed file system you wouldn't be able to retriever the output otherwise.