Search code examples
hadoopamazon-ec2apache-sparkspark-ec2

Change hadoop version using spark-ec2


I want to know is it possible to change the hadoop version when the cluster is created by spark-ec2?

I tried

spark-ec2 -k spark -i ~/.ssh/spark.pem -s 1 launch my-spark-cluster

then I login with

spark-ec2 -k spark -i ~/.ssh/spark.pem login my-spark-cluster

and found out the hadoop version is 1.0.4.

I want to use 2.x version of hadoop, what's the best way to do configure this?


Solution

  • Hadoop 2.0

    spark-ec2 script doesn't support modifying existing cluster but you can create a new Spark cluster with Hadoop 2.

    See this excerpt from the script's --help:

      --hadoop-major-version=HADOOP_MAJOR_VERSION
                        Major version of Hadoop (default: 1)
    

    So for example:

    spark-ec2 -k spark -i ~/.ssh/spark.pem -s 1 --hadoop-major-version=2 launch my-spark-cluster
    

    ..will create you a cluster using current version of Spark and Hadoop 2.


    If you use Spark v. 1.3.1 or Spark v. 1.4.0 and will create a standalone cluster then you will get Hadoop v. 2.0.0 MR1 (from Cloudera Hadoop Platform 4.2.0 distribution) this way.


    The caveats are:

    ..but I have successfully used a few clusters of Spark 1.2.0 and 1.3.1 created with Hadoop 2.0.0, using some Hadoop2-specific features. (for Spark 1.2.0 with a few tweaks, that I have put in my forks of Spark and spark-ec2, but that's another story.)


    Hadoop 2.4, 2.6

    If you need Hadoop 2.4 or Hadoop 2.6 then I would currently (as of June 2015) recommend you to create a standalone cluster manually - it's easier than you probably think.