Search code examples
hadoopapache-sparkmapr

Hadoop/Yarn/Spark can I call command line?


Short version of my question: We need to call the command line from within a Spark job. Is this feasible? The cluster support group indicated this could cause problems with memory.

Long Version: I have a job that I need to run on a Hadoop/MapR cluster processing packet data captured with tshark/wireshark. The data is binary packet data, one file per minute of capture. We need to extract certain fields from this packet data such as IP addresses etc. We have investigated options such a jNetPcap but this library is a bit limited. So it looks like we need to call the tshark command line from within the job and process the response. We can't do this directly during capture as we need the capture to be as efficient as possible to avoid dropping packets. Converting the binary data to text outside the cluster is possible but this is 95% of the work so we may as well run the entire job as a non-distributed job on a single server. This limits the number of cores we can use.

Command line to decode is: tshark -V -r somefile.pcap or tshark -T pdml -r somefile.pcap


Solution

  • Well, it is not impossible. Spark provides pipe method which can be used to pipe data to external process and read the output. General structure could be for example something like this:

    val files: RDD[String] = ??? // List of paths
    val processed: RDD[String] = files.pipe("some Unix pipe")
    

    Still, from your description it looks like GNU Parallel could be much better choice than Spark here.