scala elasticsearch apache-spark apache-spark-sql elasticsearch-hadoop

How to reindex data from one Elasticsearch cluster to another with elasticsearch-hadoop in Spark

I have two separated Elasticsearch clusters, I want to reindex the data from the first cluster to the second cluster, but I found that I can only setup one Elasticsearch cluster inside SparkContext configuration, such as:

var sparkConf : SparkConf = new SparkConf()
                     .setAppName("EsReIndex")
sparkConf.set("es.nodes", "node1.cluster1:9200")

So how can I move data between two Elasticsearch clusters with elastic search-hadoop in Spark inside of the same application ?

Solution

You don't need to configure the node address inside the SparkConf for the matter.

When you use your DataFrameWriter with elasticsearch format, you can pass the node address as an option as followed :

val df = sqlContext.read
                  .format("elasticsearch")
                  .option("es.nodes", "node1.cluster1:9200")
                  .load("your_index/your_type")

df.write
    .option("es.nodes", "node2.cluster2:9200")
    .save("your_new_index/your_new_type")

This should work with spark 1.6.X and the corresponding elasticsearch-hadoop connector.