Search code examples
apache-sparkapache-spark-sqlvertica

How to configure multiple node connection in spark dataframe?


I have setup vertica on cluster , there are 5 nodes . I am using below code to write data frame to vertica table:

    Map<String, String> opts = new HashMap<>();
    opts.put("table", tableName);
    opts.put("db", verticaDB);
    opts.put("dbschema", dashboardSchema);

    opts.put("user", verticaUserName);
    opts.put("password", options.verticaPassword);

    opts.put("host", verticaHost);
    opts.put("hdfs_url",hdfs url);
    opts.put("web_hdfs_url",web_hdfs_url);
    String SPARK_VERTICA_SOURCE = "com.vertica.spark.datasource.DefaultSource";
    dataFrame.write().format(SPARK_VERTICA_SOURCE).options(opts).
                                                mode(saveMode).save();

Above code is working fine, But it is connection to single master node of vertica.

I tried to pass host as connection url for multi cluster node

 master_node_ip:5433/schema?Connectionloadbalance=1&backupservernode=node2_ip,node3_ip 

I am new to spark , How i can use load balancing to connect vertica from Spark ?

Thank in Advance .


Solution

  • If you connect to Vertica that way, ConnectionLoadBalance has exactly the effect that you send the connection request to master_node_ip (strange name, as Vertica has no master node). To put it in a simplified way: The node in the cluster receiving the connect request "asks" all nodes in the cluster which is the one with the currently lowest load in number of connections. That node will then respond to the connection request, and you will be connected with that one.

    If you want more than that, your client (Spark in this case) will have to instantiate for example as many threads as you have Vertica nodes; each connects to a different Vertica node, with ConnectionLoadBalance=False, so that they remain connected exactly where they "wanted" to.

    Hope this helps - Marco