Search code examples
apache-sparkflume

integration of csv file with flume vs spark


I have a project, is to integrate a CSV files from servers of partners to our Hadoop cluster.

To do that I found Flume and Spark can do it.

I know that Spark is preferred when you need to perform data transformations.

My question is what's the difference between Flume and Spark in integration logic?
Is there a performance difference between them in importing CSV files?


Solution

  • Flume is a constantly running process that watches paths or executes functions on files. It is more comparable to Logstash or Fluentd because it's config file driven, not programmed as well as deployed and tuned.

    Preferably, you would parse said CSV files while you are reading them, then covert to a more self-describing format such as Avro, then put it into HDFS. See Morphlines Flume processors

    Spark on the other hand, you'd have to manually write all that code from end to end. While Spark Streaming can do the same thing, you generally would not run it the same way as Flume, rather you run in within YARN or other clustered scheduler, where you have no control which server it's running on because at the end of the day, you should only care if there's resource constraints.

    Other alternatives still exist such as Apache Nifi or Streamsets, which allow more visual pipeline building rather than writing code