Search code examples
apache-sparkaggregateanalyticsbigdata

Is Spark Appropriate for Analyzing (Without Redistributing) Logs from Many Machines?


I've got a number of logs spread across a number of machines, and I'd like to collect / aggregate some information about them. Maybe first I want to count the number of lines which contain the string "Message", then later I'll add up the numbers in the fifth column of all of them.

Ideally I'd like to have each individual machine perform whatever operation I tell it to on its own set of logs, then return the results somewhere centralized for aggregation. I (shakily) gather that this is analogous to the Reduce operation of the MapReduce paradigm.

My problem seems to be with Map. My gut tells me that Hadoop isn't a good fit, because in order to distribute the work, each worker node needs a common view of all of the underlying data -- the function fulfilled by HDFS. I don't want to aggregate all the existing data only so that I can then distribute operations across it; I want each specific machine to analyze the data that it (and only it) has.

I can't tell if Apache Spark would allow me to do this. The impression that I got from the quick-start guide is that I could have one master node push out a compiled arbitrary JAR and that each worker would run it, in this case over just the data identified by logic within that JAR, and return their results to the master node for me to do with what I please. But this from their FAQ makes me hesitant:

Do I need Hadoop to run Spark?

No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.

So my question is: Is Apache Spark appropriate for having an existing set of machines analyze data that they already have and aggregating the results?

If it is, could you please reiterate at a high level how Spark would process and aggregate pre-distributed, independent sets of data?

If not, are there any similar frameworks that allow one to analyze existing sets of distributed data?


Solution

  • Short answer: yes.

    You are using the workers to do work on local machines only. Nothing wrong with that. Instead of using

    sc.textFile()
    

    to read data from hdfs you would put

    java.io.File 
    

    calls - to read locally.

    Now there is an additional step you need: create a custom RDD. Why? You need to override the getPreferredLocations() to set the correct machine name for the splits.

    override def getPreferredLocations(split: Partition): Seq[String] =
    

    Then your splits need to include:

    • machine name (which is used by the getPreferredLocations() method
    • List of files for that machine . Note: that list could potentially be obtained via an rdbms call - to avoid hard-coding in a text file.

    So in the end you have a distributed processing system - and you can use all the powerful transformations available in RDD on the data read from the local machines - but now you can distribute on / operate on that data on the entire cluster. (Even though the data was originally read from individual machines)